Pyspark projects in k8s
Using the freshly pushed docker image containing spark python image. We can build new docker image pushing our custome code into the image. We are building a pyspark app reading data from minio.
PySpark app reading data from minio
from pyspark import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
def load_config(spark_context: SparkContext):
spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", "console")
spark_context._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "console123")
spark_context._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "minio-1641612822.minio.svc.cluster.local:9000")
spark_context._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
spark_context._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
load_config(spark.sparkContext)
df = spark.read.json('s3a://merobucket/orders.json')
average = df.agg({'amount':'avg'})
average.show()
Packing Pyspark app using Dockerfile
Building & Pushing Pyspark app Docker Image
docker build -t registry.logpoint.com.np/sparkapp:0.0.1 .
docker push registry.logpoint.com.np/sparkapp:0.0.1
Runing spark app on kubernetes cluster
spark-submit \
--master k8s://https://192.168.2.55:6443 \
--deploy-mode cluster \
--name spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.container.image=registry.logpoint.com.np/sparkjob:0.0.4 \
local:///opt/spark/work-dir/main.py
Since /opt/spark/work-dir is the working directory on the pyspark base image. Our spark application get copied into the same directory and we can run spark application from this directory.