In this tutorial, I only mention about setting up in Python. I recommend Anaconda to better development environment
Glue is an AWS serverless tool for ETL (Extract, Transform, Load)
Install Python Environment Anaconda
1. Visit the Anaconda downloads page.
Go to the following link: https://www.anaconda.com/products/distribution
2. Select Linux
On the downloads page, select the Linux operating system, right-click on 64-Bit (x86) Installer (581 MB)
and Copy link address
3. Use wget to download the bash installer
Now that the bash installer (.sh file) link is stored on the clipboard, use wget
to download the installer script. In a terminal, cd into the home directory and make a new directory called setup
. cd
into setup
and use wget
to download installer. Then install with bash
or sh
command. cd ~
mkdir setup
cd setup
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
bash Anaconda3-2021.11-Linux-x86_64.sh
Continue the installation according to the instructions. Once the installation is completed, you should get the following output: Do you wish the installer to initialize Anaconda3 by running conda init? [yes|no] [no] >>> yes
Type yes
and press Enter to initialize the Anaconda.
Next, activate the Anaconda environment variable with the following command: cd ~ source ~/.bashrc
4. Setup Glue Python Anaconda
Our project is using Glue 3.0, and it works well with Python 3.7. We will create a conda environment with python version 3.7.3 and name is glue
. conda create --name glue python==3.7.3
After the env is created successfully, we will install some necessary libraries. conda activate glue pip
install boto3
pip install pytest
Add source activate glue
command at the end of .bash_profile
or .profile
to automatic activate conda env cd ~
sudo nano .profile
add this command: source activate glue
Install AWS CLI
Install according to the instructions here: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
Install Glue Local Development With Glue 3.0
Complete some prerequisite steps and then use AWS Glue utilities to test and submit your Python ETL script.
Prerequisites for Local Python Development
1. Install some package:
sudo add-apt-repository ppa:webupd8team/java
sudo apt install openjdk-8-jdk
sudo apt install zip
2. Create glue
folder at the home directory and download some libraries:
cd ~
mkdir glue
- Clone the AWS Glue Python repository from GitHub (https://github.com/awslabs/aws-glue-libs) and checkout branch
glue-3.0
cd ~
cd glue
git clone https://github.com/awslabs/aws-glue-libs
cd aws-glue-libs
git checkout glue-3.0
- Install Apache Maven:
cd ~
cd glue
wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
tar xzvf apache-maven-3.6.0-bin.tar.gz
rm -rf apache-maven-3.6.0-bin.tar.gz
- Install the Apache Spark distribution:
cd ~
cd glue
wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
tar xzvf spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
rm -rf spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
3. Config Environment Variables
CD into the home directory and add the following command to .bash_profile
or .profile
cd ~
sudo nano .profile
Add at the end of file: GLUE_DEV=$HOME/glue
PATH=$GLUE_DEV/apache-maven-3.6.0/bin:$PATH
PATH=$GLUE_DEV/aws-glue-libs/bin:$PATH
export PATH
export SPARK_HOME=$GLUE_DEV/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3
export SPARK_LOCAL_IP=127.0.0.1
4. Fix some issues:
- fix
mysql driver
downloadmysql-connector-java-8.0.29.jar
then copy.jar
file intospark-3.1.1-amzn-0-bin-3.2.1-amzn-3/jars/
- Fail
import imp
when run test: Open file/home/<your user>/glue/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py
, changeimport imp
toimport importlib
netty
error: Open file/home/<your user>/glue/aws-glue-libs/bin/glue-setup.sh
, addrm -rf $ROOT_DIR/jarsv1/netty-*
bellow line 19# Run mvn copy-dependencies target to get the Glue dependencies locally mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jarsv1 dependency:copy-dependencies rm -rf $ROOT_DIR/jarsv1/netty-*
Running Your Python ETL Script
- With the AWS Glue jar files available for local development, you can run the AWS Glue Python package locally.
- Use the following utilities and frameworks to test and run your Python script.
Utility | Command | Description |
---|---|---|
AWS Glue Shell | gluepyspark | Enter and run Python scripts in a shell that integrates with AWS Glue ETL libraries. |
AWS Glue Submit | gluesparksubmit | Submit a complete Python script for execution. |
Pytest | gluepytest | Write and run unit tests of your Python code. The pytest module must be installed and available in the PATH. |
- Usage:
gluesparksubmit <script.py>