Aws Data Wrangler Versions Save

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

2.19.0

1 year ago

Noteworthy

Glue Data Quality now supported, checkout the tutorial 🔥
Delta lake support by @fvaleye
New DynamoDB read_items method by @a-slice-of-py

Features & enhancements

feat: add read_items to dynamodb module by @a-slice-of-py in https://github.com/aws/aws-sdk-pandas/pull/1877
Add deltalake support in AWS S3 with Pandas by @fvaleye in https://github.com/aws/aws-sdk-pandas/pull/1834
support for pagination for timestream.list_databases list_tables by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1846
(feat) glue data quality by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1861
Add unit test for evaluating two rulesets at once by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1871
(enhancement) Minor - wr.redshift.copy - pass through commit_transaction by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1878
(enhancement): Extend get and update ruleset DQ methods by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1882
enhancement: Adding filter to quicksight delete_all methods by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1913
enhancement: Support optional measure_name in wr.timestream.write() by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1925

Bug fixes

(fix) Check if timezone is present in column metadata by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1840
(fix) Include numpy==1.23.4 && poetry update by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1850
Fix apply_configs decorator causing function signature to be lost by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1858
forward use_threads to _validate_schemas_from_files by @robert-schmidtke in https://github.com/aws/aws-sdk-pandas/pull/1869
(fix) Minor - KeyError in wr.opensearch.seach && cleanup tests by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1879
(fix): missing timestamp data type in Timestream by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1881
Fix the Athena cache unit test errors by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1883
(fix): Handle None in databases data types by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1892

Documentation

Document the create_csv_table function's sensitivity to column order by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1923
(docs) Add extension for ipython console highlighting by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1841
(feat) Minor - add sphinx copy button for code blocks by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1854

Tests

Test infra: Add NAT gateway IP addresses to base stack SSM parameters by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1847
Testing: Update Opensearch test output and fixture by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1848
(test-infra) Enable SSE, enforce HTTPS, enable node-to-node encryption by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1851
(tests) add workaround to enable deltalake to use AWS profile creds by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1934
Enable warn_unused_ignores for MyPy by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1860
Increase coverage for dynamodb write by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1893
Add tests for S3 wait functions by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1896
Increase coverage for s3.delete* by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1897
Increase S3 tests coverage by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1909
Add coverage report to tox by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1874
Add coverage section to pyproject by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1911
Deps: Update wheel 0.37.1 -> 0.38.1 by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1904
Add minimum coverage by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1927
refactor: quicksight test resources as fixtures by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1928

New Contributors

@fvaleye made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1834
@robert-schmidtke made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1869
@a-slice-of-py made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1877

Thanks

We thank the following contributors/users for their work on this release: @jaidisido, @kukushking, @LeonLuttenberger, @cnfait, @malachi-constant, @mdavis-xyz, @dydc, @enricomarchesin

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/2.18.0...2.19.0

2.18.0

1 year ago

Noteworthy

Pyarrow 10 support 🔥 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1731
Lambda layers now available in af-south-1 (Cape Town) 🌍 by @malachi-constant

Features & enhancements

Add unload_approach to athena.read_sql_table by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1634
Pass additional partition projection params to wr.s3.to_parquet & cat… by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1627
Regenerate poetry.lock with no update by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1663
Upgrading poetry installed in workflow by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1677
Improve bucketing series generation by casting only the required columns by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1664
Add get_query_executions generating DataFrames from Athena query executions detail by @KhueNgocDang in https://github.com/aws/aws-sdk-pandas/pull/1676
Dependency: Set Pandas Version != 1.5.0 bue to memory leak by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1688
read_csv: read file as binary when encoding_errors is set to ignore by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1723
Deps: Remove upper bound limit on 'python' version by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1720
(enhancement) Redshift: Adding 'primary_keys' to parameter validation by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1728
Add describe_log_streams and filter_log_events to the CloudWatch module by @KhueNgocDang in https://github.com/aws/aws-sdk-pandas/pull/1785
Update lambda layers with pyarrow 10 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1758
Add ctas_write_compression argument to athena.read_sql_query by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1795
Add auto termination policy to EMR by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1818
timestream.query: add QueryId and NextToken to df attributes by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1821
Add support for boto3 kwargs to timestream.create_table by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1819
Adding args to submit spark step by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1826

Bug fixes

Fix athena.read_sql_query for empty table and chunk size not returning an empty frame generator by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1685
Fixing index column validation in s3.read.parquet() validate schema by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1735
Bug: Replace extra_registries with extra_public_registries by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1757
Fix: map datatype issue of athena by @pal0064 in https://github.com/aws/aws-sdk-pandas/pull/1753
Fix Redshift commands breaking with hyphenated table names by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1762
Add correct service names for timestream boto3 clients by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1716
Allow read partitions with extra = in the value by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1779

Documentation

Update install page in docs with screenshot of new managed layer name by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1636
Remove semicolon from python code eol in s3 tutorial by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1673
Consistent kernel for jupyter notebooks by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1674
Correct a few typos in our ipynb tutorials by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1694
Fix broken links in readme by @lucasasmith in https://github.com/aws/aws-sdk-pandas/pull/1702
Typos in comments and docs by @mycaule in https://github.com/aws/aws-sdk-pandas/pull/1761

Tests

Support for test infrastructure in private subnets by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1698
Upgrade engine versions to match defaults from aws console by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1709
Set redshift and Neptune clusters removal policy to destroy by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1675
Upgrade pytest-xdist by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1760
Fix timestream endpoint tests by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1781

New Contributors

@lucasasmith made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1702
@vikramsg made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1757
@mycaule made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1761
@pal0064 made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1753

Thanks

We thank the following contributors/users for their work on this release: @lucasasmith, @vikramsg, @mycaule, @pal0064, @LeonLuttenberger, @cnfait, @malachi-constant, @kukushking, @jaidisido

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/2.17.0...2.18.0

3.0.0rc2

1 year ago

What's Changed

(enhancement): Enable missing unit tests and Redshift, Athena, LF load tests by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1736
(enhancement): configure scheduling options, remove dependencies on internal ray impl by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1734
(testing): Enable Athena and Redshift tests, and address errors by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1721
(feat): Make tqdm progress reporting opt-in by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1741

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0rc1...3.0.0rc2

3.0.0rc1

1 year ago

What's Changed

(enhancement): Move RayLogger out of non-distributed modules by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1686
(perf): Distribute data types inference by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1692
(docs): Update config tutorial to include new configuration values by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1696
(fix): partition block overwriting by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1695
(refactor): Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1699
(docs): Improve documentation on running SDK for pandas at scale by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1697
(enhancement): Apply modin repartitioning where required only by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1701
(enhancement): Remove local from ray.init call by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1708
(feat): Validate partitions along row axis, add warning by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1700
(feat): Expand SQL formatter to LakeFormation by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1684
(feat): Distribute parquet datasource and add missing features, enable all tests by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1711
(convention): Add Arrow prefix to parquet datasource for consistency by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1724
(perf): Distribute Timestream write with executor by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1715

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b3...3.0.0rc1

3.0.0b3

1 year ago

What's Changed

(feat): Add partitioning on block level by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1653
(refactor): Make room for additional distributed engines by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1646
(feat): Distribute s3 write text by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1631
(docs): Add "Introduction to Ray" Tutorial by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1661
(fix): Return address config param by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1660
(refactor): Enable new engines with custom dispatching and other constructs by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1666
(deps): Uptick modin to 0.16 by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1659

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b2...3.0.0b3

3.0.0b2

1 year ago

What's Changed

(feat) Update to Ray 2.0 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1635
(feat) Ray logging by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1623
(enhancement): Reduce LOC in S3 write methods create_table by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1626
(docs) Tutorial: Run SDK for pandas job on ray cluster by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1616

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b1...3.0.0b2

3.0.0b1

1 year ago

What's Changed

(test) Consolidate unit and load tests by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1525
(feat) Distribute S3 read text by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1567
(feat) Distribute s3 wait_objects by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1539
(test) Ray Load Tests CDK Stack and Instructions for Load Testing by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1583
(fix) Fix S3 read text with version ID was not working by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1587
(feat) Add distributed s3 write parquet by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1526
(fix) Distribute write text regression, change to singledispatch, add repartitioning utility by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1611
(enhancement) Optimise distributed s3.read_text to load data in chunks by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1607

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0a2...3.0.0b1

2.17.0

1 year ago

New Functionalities

RedshiftDataAPI serverless support 🔥 #1530
- Check out the tutorial
Add get_query_results to the Athena module #1496
- Check out the function documentation
Add generate_create_query to the Athena module #1514
- Check out the function documentation

Enhancements

Returning empty DataFrame for empty TimeStream query #1430
Added support for INSERT IGNORE for mysql.to_sql #1429
Added use_column_names to redshift.copy akin to redshift.to_sql #1437
Enable passing kwargs to redshift.connect #1467
Add timestream_endpoint_url property to the config #1483
Add support for upserting to an empty Glue table #1579

Documentation

Fix typos in documentation #1434

Bug Fix

validate_schema=True for wr.s3.read_parquet breaks with partition columns and dataset=True #1426
wr.neptune.to_property_graph failing for Neptune version 1.1.1.0 #1407
ValueError when using opensearch.index_df with documents with an array field #1444
Missing catalog_id in wr.catalog.create_database #1480
Check for pair of brackets in query preparation for Athena cache #1529
Fix wrong type hint for TagColumnOperation in quicksight.create_athena_dataset #1570
s3.to_json compression parameters is passed twice when dataset=True #1585
Cast Athena array, map & struct types to pandas object #1581
In the OpenSearch module, use SSL only for HTTPS (port 443) #1603

Noteworthy

AWS Lambda Managed Layers

Since the last release, the library has been accepted as an official SDK for AWS, and rebranded as AWS SDK for pandas 🚀. The module names in Python will remain the same. One noteworthy change, however, is that the AWS Lambda Manager layer name has been renamed from AWSDataWrangler to AWSSDKPandas.

You can view the ARN value for the layers here.

PyArrow 7 Support

⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):

pip install pyarrow==2 awswrangler

Thanks

We thank the following contributors/users for their work on this release:

@bechbd, @maxispeicher, @timgates42, @aeeladawy, @KhueNgocDang, @szemek, @malachi-constant, @cnfait, @jaidisido, @LeonLuttenberger, @kukushking

3.0.0a2

1 year ago

This is a pre-release for the Wrangler@Scale project

What's Changed

(feat): Add directory for Distributed Wrangler Load Tests by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1464
(CI): Distribute tests in tox config by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1469
(feat): Distribute s3 delete objects by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1474
(CI): Enable new CI pipeline for standard & distributed tests by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1481
(feat): Refactor to distribute s3.read_parquet by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1513
(bug): s3 delete tests failing in distributed codebase by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1517

Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/3.0.0a1...3.0.0a2

3.0.0a1

1 year ago

This is a pre-release for the Wrangler@Scale project

What's Changed

(feat): Add distributed config flag and initialise method by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1389
(feat): Add distributed Lake Formation read by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1397
(feat): Distribute S3 select over multiple paths and scan ranges by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1445
(refactor): Refactor threading/ray; add single-path distributed s3 select impl by @kukushking in https://github.com/awslabs/aws-data-wrangler/pull/1446

Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/2.16.1...3.0.0a1