Amazon AWS Glue ETL Now Generally Available

aws

Amazon AWS announced the general availability of AWS GlueGlue is a fully managed, serverless, and cloud-optimized extract, transform and load (ETL) service. Glue is different from other ETL services and platforms in a few very important ways.

First, Glue is “serverless” – you don’t need to provision or manage any resources and you only pay for resources when Glue is actively running. Second, Glue provides crawlers that can automatically detect and infer schemas from many data sources, data types, and across various types of partitions. It stores these generated schemas in a centralized Data Catalog for editing, versioning, querying, and analysis. Third, Glue can automatically generate ETL scripts (in Python!) to translate your data from your source formats to your target formats. Finally, Glue allows you to create development endpoints that allow your developers to use their favorite toolchains to construct their ETL scripts.

Crawlers

Glue first uses a Crawler that detects data partitions and derives schema from S3 folders to build a table.

 

ETL Jobs

Tasks can be scheduled via a Jobs console. Automatically created mappings can be adjusted and the PySpark code tweaked.

 

Glue generates PySpark scripts to transform data based on the UX selections. On the left a diagram ETL flow is shown. On the top right, a series of buttons is provided to add annotated data sources and targets, transforms, and other features.

Development Endpoints and Notebooks

A Development Endpoint is an environment used to develop and test Glue scripts.

Once provisioned ETL developers can spin up an Apache Zeppelin notebook server by going to actions and clicking create notebook server to SSH into the server or connect to the notebook to interactively test the ETL script.

Pricing and Documentation

AWS Glue pricing is information is available here. Glue crawlers, ETL jobs, and development endpoints are all billed in Data Processing Unit Hours (DPU) (billed by minute). Each DPU-Hour costs $0.44 in us-east-1. A single DPU provides 4vCPU and 16GB of memory.

Additional information is availavle in the AWS Glue documentation and service FAQs. The product team also released, aws-glue-libs, a set of utilities for connecting, and talking with Glue and samples at aws-glue-samples.