Spark Operator container Image for Amazon EKS

This is how to create the necessary docker images to run Spark on Amazon EKS (Elastic Kubernetes Service) using Spark on k8s Operator. This is because the provided images for Hadoop 3 did not work out of the box with the IAM role associated with the Service Account. This is necessary for example for reading and writing to S3.

Create Amazon ECR repositories

I store the base images in Amazon ECR but you can do the same with a different container registry if you wish.

In the AWS Management Console, navigate to Elastic Container Registry and create 2 repositories.

Screenshot 2021-08-11 at 17.46.26.png

The repo names are expected to have specific names

<namespace>/spark
<namespace>/spark-py

I used spark-operator as the namespace.

Screenshot 2021-08-11 at 17.43.22.png

I choose to create public repositories, but you can create private repositories as well.

For public repos, you can log in to ECR with

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws

for private repos, I login with

aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account id>.dkr.ecr.<region>.amazonaws.com

Build Spark base image

I clone the Spark repository

git clone git@github.com:apache/spark.git

I checkout a specific version

git checkout v3.1.1

I build the project with a specific Hadoop version (3.3.1 in this case). It is important to build it with Kubernetes support by including the corresponding flag.

./build/mvn -Pkubernetes -Dhadoop.version=3.3.1 -DskipTests clean package

The Hadoop version dictates the version of hadoop-aws, which is also 3.3.1 in this case
This in turn dictates the version of aws-java-sdk-bundle, which is 1.11.901.
A recent version of the AWS SDK is needed so that it supports the com.amazonaws.auth.WebIdentityTokenCredentialsProvider.

I build and tag the docker image (including the python profile)

./bin/docker-image-tool.sh -r public.ecr.aws/z2m5w4m3/spark-operator -t v3.1.1-hadoop3.3.1 -p ./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build

I push the images

./bin/docker-image-tool.sh -r public.ecr.aws/z2m5w4m3/spark-operator -t v3.1.1-hadoop3.3.1 push

and the image tags appear on ECR

Screenshot 2021-08-11 at 17.48.56.png

Build Spark application image

Finally, I build on top of the Spark base image a new docker image that additionally includes

the correct version of hadoop-aws library
the correct version of aws-java-sdk-bundle
your application code

For a PySpark application, here is an example Dockerfile, where application.py is stored beside the Dockerfile

ARG SPARK_VERSION=3.1.1
ARG HADOOP_VERSION=3.3.1

FROM ubuntu:bionic as downloader

ARG HADOOP_VERSION=3.3.1
ARG AWS_SDK_BUNDLE_VERSION=1.11.901

RUN apt-get update && apt-get install -y \
  wget \
  && rm -rf /var/lib/apt/lists/*

RUN wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -P /tmp/spark-jars/
RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_BUNDLE_VERSION}/aws-java-sdk-bundle-${AWS_SDK_BUNDLE_VERSION}.jar -P /tmp/spark-jars/

FROM public.ecr.aws/z2m5w4m3/spark-operator/spark-py:v${SPARK_VERSION}-hadoop${HADOOP_VERSION}

USER root

COPY --from=downloader /tmp/spark-jars/* $SPARK_HOME/jars/
COPY application.py /opt/spark-job/application.py

Similarly to the base image, you can push this to a private repository in ECR.

Running a Spark Job

I create a manifest file my-spark-app.yaml

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: my-pyspark-app
  namespace: default
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "<URI of private ECR repository with docker image containing the spark application>"
  imagePullPolicy: Always
  mainApplicationFile: "local:///opt/spark-job/application.py"
  sparkVersion: "3.1.1"
  hadoopConf:
    fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.1.1
    serviceAccount: spark
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.1.1

where the image should be set to the URI of your private ECR repo that holds the image from the previous section. It is important not to forget to include

spec:
  hadoopConf:
    fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider

The spark Service Account has an associated IAM role that permits access to S3 or other AWS resources. Creating an EKS cluster and Service Accounts can be easily done with AWS CDK. See this post for more details, or at this github repo.

Finally, I apply the manifest

kubectl apply -f my-spark-app.yaml

Conclusion

It is possible to use IAM roles to write to S3 from a Spark Job running with the Spark Operator. These roles are associated to a Service Account. For this, it is necessary to include a recent version of aws-java-sdk-bundle, which requires to build the Spark docker image from the source, with the necessary version of Hadoop.

Deep Dive - Codiply.com