Setup Glue On Linux and Windows WSL

2 min read

In this tutorial, I only mention about setting up in Python. I recommend Anaconda to better development environment

Glue is an AWS serverless tool for ETL (Extract, Transform, Load)

Install Python Environment Anaconda

1. Visit the Anaconda downloads page.

Go to the following link: https://www.anaconda.com/products/distribution

2. Select Linux

On the downloads page, select the Linux operating system, right-click on 64-Bit (x86) Installer (581 MB) and Copy link address

3. Use wget to download the bash installer

Now that the bash installer (.sh file) link is stored on the clipboard, use wget to download the installer script. In a terminal, cd into the home directory and make a new directory called setup. cd into setup and use wget to download installer. Then install with bash or sh command.

cd ~
mkdir setup
cd setup
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
bash Anaconda3-2021.11-Linux-x86_64.sh


Continue the installation according to the instructions. Once the installation is completed, you should get the following output:

Do you wish the installer to initialize Anaconda3 by running conda init? [yes|no] [no] >>> yes

Type yes and press Enter to initialize the Anaconda.
Next, activate the Anaconda environment variable with the following command:
cd ~ source ~/.bashrc

4. Setup Glue Python Anaconda

Our project is using Glue 3.0, and it works well with Python 3.7. We will create a conda environment with python version 3.7.3 and name is glue.

conda create --name glue python==3.7.3

After the env is created successfully, we will install some necessary libraries.

conda activate glue pip
install boto3
pip install pytest

Add source activate glue command at the end of .bash_profile or .profile to automatic activate conda env

cd ~
sudo nano .profile
add this command: source activate glue

    Install AWS CLI

    Install according to the instructions here: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

    Install Glue Local Development With Glue 3.0

    Complete some prerequisite steps and then use AWS Glue utilities to test and submit your Python ETL script.

    Prerequisites for Local Python Development

    1. Install some package:

    sudo add-apt-repository ppa:webupd8team/java
    sudo apt install openjdk-8-jdk
    sudo apt install zip

    2. Create glue folder at the home directory and download some libraries:

    cd ~
    mkdir glue

    cd ~
    cd glue
    git clone https://github.com/awslabs/aws-glue-libs
    cd aws-glue-libs
    git checkout glue-3.0
    • Install Apache Maven:
    cd ~
    cd glue
    wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
    tar xzvf apache-maven-3.6.0-bin.tar.gz
    rm -rf apache-maven-3.6.0-bin.tar.gz
    • Install the Apache Spark distribution:
    cd ~
    cd glue
    wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
    tar xzvf spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
    rm -rf spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz

    3. Config Environment Variables

    CD into the home directory and add the following command to .bash_profile or .profile
    cd ~
    sudo nano .profile
    Add at the end of file:
    GLUE_DEV=$HOME/glue
    PATH=$GLUE_DEV/apache-maven-3.6.0/bin:$PATH
    PATH=$GLUE_DEV/aws-glue-libs/bin:$PATH
    export PATH
    export SPARK_HOME=$GLUE_DEV/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3
    export SPARK_LOCAL_IP=127.0.0.1

    4. Fix some issues:

      • fix mysql driver
        download mysql-connector-java-8.0.29.jar then copy .jar file into spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/jars/
      • Fail import imp when run test: Open file /home/<your user>/glue/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py, change import imp to import importlib
      • netty error: Open file /home/<your user>/glue/aws-glue-libs/bin/glue-setup.sh, add rm -rf $ROOT_DIR/jarsv1/netty-* bellow line 19
        # Run mvn copy-dependencies target to get the Glue dependencies locally mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jarsv1 dependency:copy-dependencies rm -rf $ROOT_DIR/jarsv1/netty-*

      Running Your Python ETL Script

      • With the AWS Glue jar files available for local development, you can run the AWS Glue Python package locally.
      • Use the following utilities and frameworks to test and run your Python script.
      UtilityCommandDescription
      AWS Glue ShellgluepysparkEnter and run Python scripts in a shell that integrates with AWS Glue ETL libraries.
      AWS Glue SubmitgluesparksubmitSubmit a complete Python script for execution.
      PytestgluepytestWrite and run unit tests of your Python code. The pytest module must be installed and available in the PATH.
      • Usage: gluesparksubmit <script.py>

      Reference

      Avatar photo

      Leave a Reply

      Your email address will not be published. Required fields are marked *