Aws Data Wrangler Versions Save

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

3.4.0

7 months ago

Features/Enhancements ๐Ÿš€

  • Geospatial - parse Athena geospatial types via geopandas by @kukushking in #2346
  • Allow group identifiers to be used in wr.cloudwatch queries by @LeonLuttenberger in #2430
  • Add ignore null store parquet metadata by @raaidarshad in #2450

Bug fixes ๐Ÿ›

  • Add missing boto3 session in athena.to_iceberg wait_query by @jaidisido in #2428
  • Add catalog ID in athena.to_iceberg by @jaidisido in #2446
  • Return None for missing column and partition key comment by @robert-schmidtke in #2449
  • Fix urllib3 error when building AWS Lambda Layers by @LeonLuttenberger in #2447
  • Duplicate schema argument in wr.s3.to_parquet by @kukushking in #2455

Tests ๐Ÿงช

  • Test dependabot groups feature by @jaidisido in #2426

New Contributors

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.3.0...3.4.0

3.3.0

9 months ago

Features/Enhancements ๐Ÿš€

Bug fixes ๐Ÿ›

Tests ๐Ÿงช

New Contributors

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.2.1...3.3.0

3.2.1

10 months ago

Fixes ๐Ÿ› ๏ธ

  • Fix error where library could not be imported on Windows due to No module named 'pyarrow._orc' by @LeonLuttenberger in #2341 #2337
  • Lower packaging version requirement by @LeonLuttenberger in #2340
  • Allow Ray 2.5 & downgrade tox by @kukushking in #2338

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.2.0...3.2.1

3.2.0

10 months ago

Features/Enhancements ๐Ÿš€

Bug fixes ๐Ÿ›

Documentation ๐Ÿ“š

Tests :test_tube:

Refactoring ๐Ÿ› ๏ธ

New Contributors

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.1.1...3.2.0

3.1.1

11 months ago

What's Changed

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.1.0...3.1.1

3.1.0

11 months ago

Features/Enhancements ๐Ÿš€

  • Add neptune.bulk_load for bulk loading data into Neptune by @LeonLuttenberger in #2238 #2267
  • Add s3.to_deltalake function by @LeonLuttenberger in #2228
  • Add Timestream Batch Load support by @jaidisido in #2214
  • Add Iceberg insert by @kukushking in #2233
  • Support upsert mode for OracleDB by @LeonLuttenberger in #2265
  • Add chunked parameter to DynamoDB read functions by @LeonLuttenberger in #2227
  • Upgrade Modin to 0.20.1 & allow Ray 2.4 by @kukushking in #2234
  • Support Glue Connection SSM credential type by @kukushking in #2232
  • Add ability to pass schema to S3 Select by @kukushking in #2237
  • Add dynamic classification EMR config by @LLejoly in #2250
  • Add support for server-side cursors in PostgreSQL module by @kukushking in #2262
  • Add time unit to Timestream write API by @jaidisido in #2263

Fixes ๐Ÿ› ๏ธ

  • Set ignore_metadata to False by default by @jaidisido in #2206
  • Fix conflicting types for path_ignore_suffix by @LeonLuttenberger in #2240
  • Athena workgroup query engine v3 upgrade artifacts by @kukushking in #2243
  • Fixing test_spectrum_decimal_cast test by @LeonLuttenberger in #2244
  • emr.create_cluster was not passing security configuration to internal method by @malachi-constant in #2246
  • Fix pagination in timestream.list_tables by @SukruHan #2275

Documentation ๐Ÿ“š

  • Include our ADRs in GitHub by @LeonLuttenberger in #2215 #2259
  • Fixes in the Athena Cache tutorial by @patrick-muller in #2201
  • Write ADR for the switching between PyArrow and Pandas I/O functions by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/2245
  • Fix "about" URL in README by @CGarces in #2207
  • Update layers.rst with Python 3.10 layers by @LeonLuttenberger in #2219
  • Fix links to 'Who uses library' section by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/2241
  • Declutter function overloads by extracting overloads to pyi files by @LeonLuttenberger in #2229 #2255 #2256

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0...3.1.0

3.0.0

1 year ago

Breaking changes ๐Ÿ’ฅ

  • Move dependencies to optional by @jaidisido in #1992 ๐Ÿ”“
    • Dependencies required by the following modules have been moved to optional: redshift, mysql, postgres, sqlserver, oracle, gremlin, sparql, deltalake
    • The required dependencies can be easily installed with pip install awswrangler[<MODULE_NAME>], for example pip install awswrangler[redshift]
  • Change SQL formatters for Athena and LakeFormation so that they properly format types by @Taragolis and @LeonLuttenberger in #1416 #1543 #1684 ๐Ÿ’พ
    • For example a parameter of type dt.datetime is parsed into DATETIME xxxx-xx-xx xx:xx:xx, while a parameter of type str is formatted into "x"
  • Refactor function signatures so that closely related parameters are grouped into a single parameter defined as a TypeDict by @LeonLuttenberger and @kukushking in #1855 #1996 #2016 #2055 #2081 ๐Ÿ’ผ
    • Glue catalog parameters are grouped together in to_parquet, to_csv and to_json
    • Athena UNLOAD and CTAS parameters are grouped together
  • Deprecate wr.s3.merge_upsert_table by @kukushking in #2076 โš ๏ธ
  • Deprecate updated_name parameter in update_ruleset by @jaidisido in #2122 โš ๏ธ
  • Stop support for Python 3.7 โš ๏ธ

New functionalities ๐Ÿš€

AWS SDK for pandas can now run at scale ๐Ÿš€๐Ÿ’ป๐Ÿš€

Tutorials

AWS Blogs

Features/Enhancements ๐Ÿš€

  • Thread-safety improvements by @kukushking in #2186
  • Allow Python 3.11 by @kukushking in #2101 ๐Ÿ
  • Add use_theads parameter to dynamodb.read_items by @LeonLuttenberger in #2113 ๐Ÿ“ˆ
  • Distribute wr.dynamodb.put_df with executor task by @LeonLuttenberger in #2118 ๐Ÿ“ˆ
  • Add additional arg for glue database DatabaseInput by @malachi-constant in #2067 ๐Ÿ”ง
  • Add overloads for function which can have multiple return value types by @LeonLuttenberger #1855
  • Add support for boto3 kwargs to timestream.create_table by @cnfait in #1819
  • Upgrade Ray to 2.2.x and PyArrow to 7+ by @LeonLuttenberger in #1865
  • Upgrade to Ray 2.0 by @kukushking in #1635
  • Add partitioning on block level by @kukushking in #1653
  • Use fast file metadata provider by @kukushking in #1997
  • Distribute DynamoDB Parallel Scan by @jaidisido in #1981
  • Add faster Pyarrow S3fs listing in distributed mode by @jaidisido in #2030
  • Add distributed variant of the _read_parquet_metadata_file function based on the PyArrow file system by @LeonLuttenberger in #2050
  • Validate distributed kwargs by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/2051
  • Add @Experimental and @Deprecated annotations by @kukushking in #2062
  • Distribute S3 describe_objects by @jaidisido in #2069
  • Distributed S3 copy/merge by @kukushking in #2070
  • Add bulk_read option for reading large amounts of Parquet files quickly by @LeonLuttenberger in #2033
  • Deprecate boto3 resources by @kukushking in #2097
  • Add retries for s3 select by @kukushking in #1780
  • Make tqdm progress reporting opt-in by @kukushking in #1741
  • Distribute data types inference by @jaidisido in #1692
  • Change to singledispatch, add repartitioning utility, fix distributed write text regression by @kukushking in #1611
  • Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in #1699
  • Configure scheduling options, remove dependencies on internal ray impl by @kukushking in #1734
  • Validate partitions along row axis, add warning by @kukushking in #1700
  • Refactor executor module by @kukushking in #2120
  • Distribute parquet datasource and add missing features, enable all tests by @kukushking in #1711
  • Distribute Timestream write with executor by @jaidisido in #1715
  • Distribute s3.to_json and s3.to_csv by @LeonLuttenberger in #1631
  • Distribute s3.read_csv, s3.read_json and s3.read_fwf by @LeonLuttenberger in #1567 #1607
  • Distribute s3.wait_objects by @LeonLuttenberger in #1539
  • Distribute s3.to_parquet by @kukushking in #1526
  • Distribute s3.delete objects by @malachi-constant in #1474
  • Distribute s3.read_parquet by @jaidisido in #1513
  • Add ThreadPoolExecutor and RayExecutor; refactor threading/ray; add single-path distributed s3.select_query by @kukushking in #1446
  • Add distributed Lake Formation read by @jaidisido in #1397
  • Refactor ray datasources by @kukushking in #1687
  • Distribute S3 select over multiple paths and scan ranges by @jaidisido in #1445
  • Add Literal typing for mode and projection_types by @LeonLuttenberger in #2191

Fixes ๐Ÿ› ๏ธ

  • Sanitize bucketing col names by @kukushking in #2155
  • Allow writing files from an empty dataframe by @malachi-constant in #2045
  • Athena out of bound dates by @kukushking in #2180
  • Fix partition block overwriting by @kukushking in #1695
  • Distrib S3 Select - check row count before creating the Ray dataset by @kukushking in #1808
  • Allow to pass pandas dfs to Ray/Modin calls by @kukushking in #1812
  • Add retries to read_parquet_metadata_distributed by @jaidisido in #2196
  • Fix default utcnow argument in start_query by @LeonLuttenberger in #2193

Documentation ๐Ÿ“š

  • Athena Iceberg tutorial by @kukushking in #2117
  • Add at scale section by @kukushking in #2119
  • Documentation spell-checking improvements by @LeonLuttenberger in #2165
  • Add AWS Glue on Ray docs by @jaidisido in #1810
  • Update config tutorial to include new configuration values by @LeonLuttenberger in #1696
  • Improve documentation on running SDK for pandas at scale by @jaidisido in #1697
  • Add "Introduction to Ray" Tutorials by @LeonLuttenberger in #1661
  • Add SDK for pandas job on ray cluster tutorial by @malachi-constant in #1616
  • Add typeddicts to docs by @LeonLuttenberger in #2167

Tests ๐Ÿงช

  • Add PR linter Github action by @jaidisido in #2106
  • Replace load tests bucket with SSM parameter by @jaidisido in #2121
  • opensearch index cleanup / skip by @kukushking in #2149
  • Add benchmark tests by @jaidisido in #2143
  • Add tests for Glue Ray jobs by @LeonLuttenberger in #1832
  • Remove awswrangler.distributed from coverage report by @LeonLuttenberger in #1884
  • Consolidate unit and load tests by @jaidisido in #1525
  • Distribute tests in tox config by @malachi-constant in #1469

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/2.20.1...3.0.0

2.20.1

1 year ago

What's Changed

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/2.20.0...2.20.1

3.0.0rc3

1 year ago

What's Changed

Breaking changes:

Features/Enhancements:

Fixes:

Documentation:

Tests:

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0rc2...3.0.0rc3

2.20.0

1 year ago

Breaking changes

  • dynamodb.read_partiql no longer performs a Scan operation under the hood. Instead the ExecuteStatement API is used. It means that the PartiQL* IAM permission is required instead of Scan

Noteworthy

What's Changed

Documentation

Tests

New Contributors

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/2.19.0...2.20