aws glue pyspark documentation


AWS Data Wrangler

Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).


⚠️

For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):

➡️

pip install pyarrow==2 awswrangler

Table of contents

Quick Start


⚠️

For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):

➡️

pip install pyarrow==2 awswrangler

Read The Docs

Community Resources

Please send a Pull Request with your resource reference and @githubhandle.

Logging

Who uses AWS Data Wrangler?

Knowing which companies are using this library is important to help prioritize the project internally.

Please send a Pull Request with your company name and @githubhandle if you may.

What is Amazon SageMaker Data Wrangler?

Amazon SageMaker Data Wrangler is a new SageMaker Studio feature that has a similar name but has a different purpose than the AWS Data Wrangler open source project.

  • AWS Data Wrangler is open source, runs anywhere, and is focused on code.

  • Amazon SageMaker Data Wrangler is specific for the SageMaker Studio environment and is focused on a visual interface.

  • When reading timestamp values that are out of bounds for Timestamp[ns] the values just get wrapped around to different timestamps (as far as I can tell). This only seems to be a problem for when you set ctas_approach=True in the athena.read_sql_query function. If you set this to False it is read in correctly. So I guess there is a mismatch between how Athena writes the data to parquet and wrangler reads in the parquet file?

    Soz this is fairly long as tried to make something reproducible. If you already have tables that have columns with out of bounds timestamps (for ns) then you can skip to my # post-setup comment in the code.

    Some further notes

    In the example above if you set ctas_approach = False you will get an Pandas error due to trying to apply dt accessor in something that isn’t a dt. However, on our tables where the data is written to parquet via other means (spark or pyarrow) and you just query those values with ctas_approach=False you do not get the timestamp wrapping error. If you need it, I can try and create a reproducible error that lets me write parquet files to S3 using arrow (or something) but this is hopefully enough?

    For our etl pipelines we now use Arrow to read in data and then try to strictly convert the timestamps to pandas object arrays to avoid Timestamp[ns] issues. I don’t know if you have an option in this function or package to do the same (/ point me in the direction to PR it)? I am not 100% sure if this is actually a pandas read in issue rather than something like a weird mismatch in the parquet metadata between the Athena engine and wrangler. Finally, dates are unaffected the resulting output for the d column was correct.

    Also worth noting we are still using the Athena v1 engine.

  • Hi – wr.s3.store_parquet_metadata is not correctly converting a category column to string.

    The Glue table that gets written using wr.s3.store_parquet_metadata writes correctly, but the ‘ticker’ field gets converted to string and is null.image

    I tried setting dtype={‘ticker’:’string’} but still null. Any thoughts?

  • P.S. Don’t attach files. Please, prefer add code snippets directly in the message body.

    The error said an integer is required. I have tried to be more specific on the query to identify which columns causing the problem. It turns out to be the datetime column. In MySQL, the data type is datetime and in glue the data type is timestamp.

    There is no problem for other tables. Not sure why this happen.

  • gluecontext.write_dynamic_frame.from_jdbc_conf() as below?
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "test_red", connection_options = {"preactions":"truncate table target_table;","dbtable": "target_table", "database": "redshiftdb"}, redshift_tmp_dir = 's3://s3path', transformation_ctx = "datasink4")

  • Description of changes:
    initialize code: only structure and _write functions signatures

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

  • I am trying to send a DataFrame to S3 by using a previously created boto3 session and get the following error:

    I believe it i trying to get the credentials from the default profile in ~/.aws/credentials.

  • Description of changes:
    I’ve tried to add a first draft for generating HTML docs out of the existing notebooks. However this approach has two major downsides:

    Nevertheless the generation works more or less (some CSS finetuning probably required) and the Notebooks are included in the generated docs. So maybe this state can server as a starting point for discussion.

    PS: @igorborgest There were some leading whitespaces copy-paste “errors” in #562. These are also fixed in this PR but I can also create a separate one for that.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

  • I would like to open an issue as we have seen quite unsatisfying performance using the read_parquet function. This is our setup and data below:

    I’ve run a couple of tests to verify whether there would be any speed improvement if I passed a list of prefixes for the function to combine instead of using the partition_filter but the gain was marginal. Enabling use_threads=True gave no improvement. Overall it takes around 13 minutes to collect all files… this is just too long. Downloading them with aws sync takes a few seconds.

    Our main use case for operating on streams is in AWS Batch. We have some data loaders that use the data wrangler when we train our ML model in AWS Batch. We realized after some time that the main contributor to an extended training time, is the part where the data is collected from AWS using the data wrangler (primarily the wr.s3.read_parquet). Please also note that we’re not taking of big data here. Most of our use cases is like described above.

    At the moment we’re wondering whether this can be optimized or if we should move away from the streaming approach, and simply download the data on the container for model training. Could you give some advice? What’s your take on that?

  • I am attempting to access data in our s3 datalake. Since production systems are writing the data I want along with other files that I don’t want, simply using a prefix is insufficient to get the data I need.

    Using awswrangler for the above task isn’t viable. I know that boto3 doesn’t allow for wildcard filtering, but surely it must be doable if dask is able to implement that functionality?

  • I’m not sure the s3.read_csv function really reads a csv in chunks. I noticed that for relatively big dataframes, running the following instruction takes an abnormally large amount of time:

    I’m running awswrangler==1.1.2 (installed with poetry) but I quickly tested 1.6.3 and it seems the issue is there too.

    I compared two different ways to load the first 100 lines of a “big” (1.2 GB) dataframe from S3:

    The timings are more or less reproducible. After comparing the last two timings, I suspect that the chunksize parameter is ignored. It takes more or less the same amount of time to load 100 lines of the file than to read the full file.

  • Description of changes: in awswrangler/s3/_write_text.py in the to_json function I changed ‘filename.csv’ to ‘filename.json’.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

  • When adding a CSV partition to a Glue Catalog, table names weren’t being sanitized.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

  • Issue:
    You can pass both a boolean and an integer to use_threads. This was documented in some functions, but not all.

    Description of changes:
    I’ve updated all functions that use use_threads to:

    Note that I have not checked every single function manually, on whether this behavior is the correct one. From a quick check, it seems to be the case. I hope one of the maintainers can confirm.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

  • It’s not possible to limit the number of keys returned by wr.s3.list_objects using the MaxKeys keyword from list_objects_v2 or the MaxItems keyword from the ListObjectsV2 paginator. It’s getting passed through but I think isn’t compatible with the paginator implementation:

    https://github.com/awslabs/aws-data-wrangler/blob/066b81a5778cca7ce5ea8d889a2fec9824d8996a/awswrangler/s3/_list.py#L97

    You might consider making "PaginationConfig" configurable to the caller so that MaxItems can be set there, and/or maybe adding a flag to skip the paginator entirely if s3_additional_kwargs["MaxKeys"] < args["PaginationConfig"]["PageSize"].

    Using wrangler version 2.11.0 in a Python 3.8 Lambda.

    P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

  • I'm not sure if this is expected behavior or a bug:

    It looks like the guilty party might be this line, the input dtype gets overwritten by the current one.

  • P.S. Don't attach files. Please, prefer add code snippets directly in the message body.
    My customer is using AWS Data Wrangler. They told me that if column name contains special characters like space or \n then it is replaced by underscore ( _ ) when converting to parquet format. They are claiming that it was not the case earlier and it has started happening for last 3 weeks and it is breaking their code. I am not able to confirm or deny it. I run the code against version 2. 3 and then against 2.11 and I got the same result of replacing special characters with underscore. I passed this feedback to customer but they still insist on their claim.

    Also please let me know if it is the default behavior for all special characters in the column name – means all special characters are replaced by underscore only when dealing with parquet format.

  • This is mostly WIP - wasn't sure how to execute tests in this codebase, nor what general guidelines were. So this just gets the ball rolling.

    Allows user to specify a desired cursor when querying MySQL data;
    defaults to pymysql's default: https://pymysql.readthedocs.io/en/latest/modules/cursors.html#pymysql.cursors.Cursor

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

  • I'm trying to put into a dynamodb table a single numpy value using aws-wrangler. To do this I'm using a lambda function with wr.dynamodb.put_items(). When I try to insert the single value I get the following error: "Unsupported type class 'numpy.int64'".
    This is the lambda code:

  • Hi,
    I am using postgresql.to_sql() API to load a pandas dataframe to AWS Aurora postgresql database.
    I have added a dtype dict that convert all int to bigint datatype, but still keep getting
    'h' format requires -32768 <= number <= 32767 error message, so whats this message mean? And is there better logging that can provide a detail error message to specify the problematic column name?

  • Using Wrangler to write to S3 as a partitioned table. Wrangler is removing records with None/NaN as the partition column value instead of using HIVE_DEFAULT_PARTITION (used by Spark/Athena/Hive for null values)

    Any workarounds to overcome this? So that records with null partitions aren't discarded while writing to a partitioned table.

    df = pd.DataFrame({'timestamp': ["1632524081000", 1632511081111, '1539826200000', '1539824400000', None, "junk"]})
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms', utc=True, errors='coerce').dt.tz_convert('Europe/Brussels').dt.tz_localize(None)
    df['date'] = df['timestamp'].dt.strftime('%Y-%m-%d')
    df['hour'] = df['timestamp'].dt.strftime('%H')

    wr.s3.to_parquet(df,
    path = "s3://test-bucket/test_part_tbl',
    dataset=True,
    schema_evolution=True,
    compression='snappy',
    partition_cols = ['date','hour'],
    database='test_db',
    table='test_part_tbl',
    use_threads=True
    )`

  • Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    Enhancements

    Documentation

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

  • Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    Enhancements

    Documentation

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

  • Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Enhancements

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

  • Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    Enhancements

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

  • Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    Enhancements

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @kukushking, @igorborgest, @gballardin, @eferm, @jaklan, @Falydoor, @chariottrider, @chriscugliotta, @konradsemsch, @gvermillion, @russellbrooks, @mshober.


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

  • Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    New Functionalities

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @igorborgest, @mattboyd-aws, @vlieven, @bentkibler, @adarsh-chauhan, @impredicative, @nmduarteus, @JoshCrosby, @TakumiHaruta, @zdk123, @tuannguyen0901, @jiteshsoni, @luminita.


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

  • Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Enhancements

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

  • Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    Enhancements

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

  • Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    New Functionalities

    Enhancements

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @igorborgest, @njdanielsen, @eric-valente, @gvermillion, @zseder, @gdbassett, @orenmazor, @senorkrabs, @Natalie-Caruana, @dragonH, @nikwerhypoport, @hwangji.


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

  • New Functionalities

    Enhancements

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @igorborgest, @njdanielsen, @eric-valente, @gvermillion, @zseder, @gdbassett, @orenmazor, @senorkrabs, @Natalie-Caruana.


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

  • New Functionalities

    Enhancements

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @gvermillion, @rodalarcon, @imanebosch, @dwbelliston, @tochandrashekhar, @kylepierce, @njdanielsen, @jasadams, @gtossou, @JasonSanchez, @kokes, @hanan-vian @igorborgest.


    P.S.
    The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

  • New Functionalities

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • New Functionalities

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • New Functionalities

    Enhancements

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • Breaking changes

    New Functionalities

    Enhancements

    Docs

    AWS re:Invent related news

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • New Functionalities

    Enhancements

    Bug Fix

    Docs

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • New Functionalities

    Enhancements

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:

    @martinSpears-ECS, @imanebosch, @Eric-He-98, @brombach, @Thomas-Hirsch, @vuchetichbalint, @igorborgest.


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • Enhancements

    Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • Enhancements

    Bug Fix

    Docs

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • Enhancements

    Bug Fix

    Docs

    Thanks

    We thank the following contributors/users for their work on this release:

    @timgates42, @bvsubhash, @DonghanYang, @sl-antoinelaborde, @Xiangyu-C, @tuannguyen0901, @JPFrancoia, @sapientderek, @igorborgest.


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • Bug Fix

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • Enhancements

    Bug Fix

    Docs

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • Breaking changes

    New Functionalities

    Enhancements

    Bug Fix

    Docs

    Thanks

    We thank the following contributors/users for their work on this release:

    @isrsal, @bppont, @weishao-aws, @alexifm, @Digma, @samcon, @TerrellV, @msantino, @alvaropc, @luigift, @igorborgest.


    P.S.
    Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

  • Bug Fix

    Docs

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel file are available below. Just upload it and run!

  • New Functionalities

    Enhancements

    Bug Fix

    Docs

    Thanks

    We thank the following contributors/users for their work on this release:

    @Thiago-Dantas, @andre-marcos-perez, @ericct, @marcelo-vilela, @edvorkin, @nicholas-miles, @chrispruitt, @rparthas ,@igorborgest.


    P.S.
    Lambda Layer zip file and Glue wheel file are available below. Just upload it and run!

  • Breaking changes

    New Functionalities

    Enhancements

    Bug Fix

    Docs

    Thanks

    We thank the following contributors/users for their work on this release:

    @kylepierce, @davidszotten, @meganburger, @erikcw, @JPFrancoia, @zacharycarter, @DavideBossoli88, @c-line, @anand086, @jasadams, @mrtns, @schot, @koiker, @flaviomax, @bryanyang0528, @igorborgest.


    P.S.
    Lambda Layer zip file and Glue wheel file are available below. Just upload it and run!

  • New Functionalities

    Enhancements

    Bug Fix

    Docs

    Thanks

    We thank the following contributors/users for their work on this release:


    P.S.
    Lambda Layer zip file and Glue wheel file are available below. Just upload it and run!

  • Enhancements


    P.S.
    Lambda Layer's zip-file and Glue's wheel/egg are available below. Just upload it and run!


    P.P.S.
    AWS Data Wrangler counts on compiled dependencies (C/C++) so there is no support for Glue PySpark by now (Only Glue Python Shell).

  • Enhancements

    Bug Fix

    Docs


    P.S.
    Lambda Layer's zip-file and Glue's wheel/egg are available below. Just upload it and run!


    P.P.S.
    AWS Data Wrangler counts on compiled dependencies (C/C++) so there is no support for Glue PySpark by now (Only Glue Python Shell).