Notebooks, Big Data and Spark Jobs in OCI — Best Practices and Examples

David Allan
9 min readDec 9, 2023

--

In the ever-evolving landscape of technology, the synergy between powerful data processing, collaborative analysis, and cloud computing has become a cornerstone for innovation. This blog explores the integration of notebooks commonly used for data science and big data on Oracle Cloud Infrastructure (OCI), illustrating how this combination unlocks new possibilities for businesses seeking advanced data analytics and efficient application development. The post (also available on Oracle blogs here) will introduce how to execute PySpark and Scala at scale from notebooks in OCI along with best practices to efficiently process that data.

Notebooks have transformed the way data scientists, analysts, and developers collaborate and conduct exploratory data analysis. With their interactive interface, support for multiple programming languages, and real-time visualization capabilities, notebooks provide a versatile platform for experimenting with code, visualizing results, and sharing insights seamlessly. This same functionality can be used whilst designing big data analysis using OCI’s Data Flow service. The exponential growth of data necessitates robust solutions for processing and extracting meaningful insights. Big data technologies, such as Apache Spark, enable the efficient handling of massive datasets. Leveraging the scalability and parallel processing capabilities of these frameworks, organizations can derive valuable insights from their data, paving the way for data-driven decision-making.

Notebooks in OCI Data Science

A data science notebook has its own compute which hosts the kernel for doing work. There are kernels provided by OCI for adding functionality to use the Spark compute instances (provided via OCI Data Flow). So now you can build notebooks for analyzing and visualizing data via large Spark clusters. Notebooks which run Spark statements (whether its Scala or PySpark) need a compute cluster to process the Spark cells. There are a number of examples in the repository below for executing PySpark, Scala, Spark SQL, PySpark with your own Conda environment which is useful for Data Science/AI/LLM use cases and visualizations that you can clone within OCI Data Science and configure with your compartment and run.

https://github.com/davidallan/oci_df_notebooks/tree/main

You can clone this git repo in your Data Science notebook and explore!

Let’s look at how that is done, first we will look at create cluster session, use existing cluster session and stop cluster session.

Create Cluster Session for Interactive Notebook

The following command is used to create a Spark compute cluster for the interactive notebook for Python. The create_session command will return an identifier that you can save and reuse for as long as the session is running. For example you could run for some time and then connect to the cluster again using use_session.

# PYSPARK Cluster Session - replace with your values;
# compartment id for creating the compute/application in OCI Dataflow
# bucket for writing OCI Dataflow logs
# namespace your OCI Object Storage namespace
# driver_shape ie. VM.Standard.E4.Flex
# executor_shape ie. VM.Standard.E4.Flex
command = {
"compartmentId": "<your_compartment_id>",
"displayName": "Data Analysis in Python",
"language": "PYTHON",
"sparkVersion": "3.2.1",
"driverShape": "<driver_shape>",
"executorShape": "<executor_shape>",
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},
"numExecutors": 1,
"type": "SESSION",
#"poolId": "poolId",
#"privateEndpointId": "privateEndpointId",
"logsBucketUri": "oci://<bucket>@<namespace>/",
"configuration": {"dataflow.auth":"resource_principal", "spark.oracle.datasource.enabled":"true", "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2", "spark.executorEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED": "true", "spark.driverEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED": "true"}
}
%create_session -l python -c $command

If you wanted to use PySpark and your own Conda environment then perform similar substitutions as above plus add the Conda gz in spark.archives property;

# PYSPARK and CONDA Cluster Session - replace with your values;
# compartment id for creating the compute/application in OCI Dataflow
# bucket for writing OCI Dataflow logs
# namespace your OCI Object Storage namespace
# driver_shape ie. VM.Standard.E4.Flex
# executor_shape ie. VM.Standard.E4.Flex
# spark.archives value, your conda env gz (value should be augmented with #conda)
command = {
"compartmentId": "<your_compartment_id>",
"displayName": "Data Analysis in Python",
"language": "PYTHON",
"sparkVersion": "3.2.1",
"driverShape": "<driver_shape>",
"executorShape": "<executor_shape>",
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},
"numExecutors": 1,
"type": "SESSION",
#"poolId": "poolId",
#"privateEndpointId": "privateEndpointId",
"logsBucketUri": "oci://<bucket>@<namespace>/",
"configuration": {"spark.archives":"oci://bucket@namespace/yourcondapath.tar.gz#conda", "dataflow.auth":"resource_principal", "spark.oracle.datasource.enabled":"true", "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2", "spark.executorEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED": "true", "spark.driverEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED": "true"}
}
%create_session -l python -c $command

If you wanted to use Scala then simply change the language in the command and session as follows;

# SPARK SCALA Cluster Session - replace with your values;
# compartment id for creating the compute/application in OCI Dataflow
# bucket for writing OCI Dataflow logs
# namespace your OCI Object Storage namespace
# driver_shape ie. VM.Standard.E4.Flex
# executor_shape ie. VM.Standard.E4.Flex
command = {
"compartmentId": "<your_compartment_id>",
"displayName": "Data Analysis in SCALA",
"language": "SCALA",
"sparkVersion": "3.2.1",
"driverShape": "<driver_shape>",
"executorShape": "<executor_shape>",
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},
"numExecutors": 1,
"type": "SESSION",
#"poolId": "poolId",
#"privateEndpointId": "privateEndpointId",
"logsBucketUri": "oci://<bucket>@<namespace>/",
"configuration": {"dataflow.auth":"resource_principal", "spark.oracle.datasource.enabled":"true", "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2", "spark.executorEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED": "true", "spark.driverEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED": "true"}
}
%create_session -l scala -c $command

Attach to existing Cluster Session for Interactive Notebook

To connect to an existing session, use the use_session command within the notebook such as (replace your_session_id below);

%use_session -s <your_session_id>

Stop Cluster Session for Interactive Notebook

To stop to an existing session, use the use_session command within the notebook such as (replace your_session_id below);

%stop_session

Now that you see how to start, attach and stop your Spark cluster session, let’s look at how to execute some Spark from a notebook in Data Science you can execute Spark Scala or PySpark commands from your notebook on the cluster. This is done using the %%spark magic command. Its quite simple.

You should also follow the following best practices to get optimal performance.

Best Practice 1 — Optimizing Conda Packs / Dependencies

Whether you are running small Spark jobs or large Data Science jobs using LLM, you should always optimize your projects. This best practice applies to Spark applications as well as use via Notebooks. Workloads always involve complex environments with numerous dependencies. In Data Science, Conda is a popular package manager, which simplifies the process of managing these dependencies, it can also make it easy to depend on way too much and make your applications bloated. As your project grows, so does the size of your Conda environment — does that sounds like a common theme? A large conda pack size will impact startup time when running Spark jobs in OCI Dataflow that use Conda packs, as time is spent uploading and downloading and distributing to driver and all executors, the more executors the more time you will spend in this activity — this also impacts smaller jobs even more as a % of the time is taken doing this. It is recommended to use smaller conda packs, installing only necessary dependencies.

How can you do this?

Conda environments are isolated spaces where packages and dependencies can be installed without affecting the system or other environments. Each environment consumes disk space, and larger environments can lead to slower deployments and increased resource usage. This applies also when Conda environments are used from OCI Dataflow for your Spark jobs, you should ensure your dependencies are as lean as possible.

Common approaches

1. Use Miniconda:

- Miniconda is a minimal Conda installer that allows you to install only the packages you need, reducing the overall size of your Conda distribution.

- This lightweight alternative to the full Anaconda distribution is ideal for optimizing data science jobs.

2. Create Lean Environments:

Only install packages that are essential for your project. Be selective about the libraries and tools you include in your environment.

You can create an empty conda pack https://conda.github.io/conda-pack and install only minimum dependencies required your Spark Data Science notebooks and jobs. There are also instructions in the OCI Data Flow documentation here for creating Conda environments that use docker, you can create the Conda gz file and upload to OCI Object Storage to reference in your applications. Below are the instructions I used when creating a minimal Conda environment where I added in numpy for example.

docker run -it --entrypoint /bin/bash --name conda_builder oraclelinux:7-slim

curl -O https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
chmod u+x Anaconda3-2022.05-Linux-x86_64.sh
./Anaconda3-2022.05-Linux-x86_64.sh
# you will have to accept the license and accept the default install location (in the container) and initialize Anaconda
source ~/.bashrc
conda create -n mypython3.8 python=3.8
conda activate mypython3.8

# install your dependencies - for example numpy
pip install numpy
# create your conda gz file
conda pack -f -o mypython3.8.tar.gz

In another terminal, copy the file created inside the container to your local file system;
docker cp conda_builder:/mypython3.8.tar.gz

Exit the container above

Let’s explore strategies to minimize Conda size, enhancing the efficiency of your data science workflows;

  • Clean Up Unused Packages — regularly review and remove packages that are no longer needed. Use the `conda clean` command to remove cached packages and free up disk space.
  • Monitor and Manage Environment Size — regularly check the size of your Conda environments, especially before deploying projects. Tools like `conda list — revisions` can help identify when changes were made to the environment and what impact they had on size.

What’s the performance impact?

Comparing a small Spark application using a minimalist Conda environment which was 90Mb in size with the pyspark conda environment which is approximately 2.2Gb you can see anything from 120s — 480s overhead when larger Conda environments are used — if you are running small jobs this overhead can be substantial so make sure you address this if its a concern.

Minimizing Conda size is crucial for optimizing data science jobs. By adopting these strategies, you can create lean and efficient environments that enhance the reproducibility, portability, and speed of your data science workflows. Remember, a thoughtful and streamlined Conda environment not only benefits your current project but also makes collaboration and deployment smoother in the long run.

Best Practice 2 — Optimizing Object Storage Access

If you are building Spark applications (batch, streaming or notebooks) on OCI Dataflow that read/write into OCI Object Storage ensure you are using the following properties;

  • spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2
  • spark.executorEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED=true
  • spark.driverEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED=true

These will significantly increase the performance of your applications. When running the Terasort benchmark, performance utilizing these performance numbers has increased performance in the range of 600% in internal tests. You can then leverage the scaling features of OCI Dataflow to easily match your SLA — with Terasort for example you can increase cores to gain substantial improvements in performance.

For example these properties are set in the configuration property as shown below.

# Replace with your values;
# compartment id for creating the compute/application in OCI Dataflow
# bucket for writing OCI Dataflow logs
# namespace your OCI Object Storage namespace
# driver_shape ie. VM.Standard.E4.Flex
# executor_shape ie. VM.Standard.E4.Flex
command = {
"compartmentId": "<your_compartment_id>",
"displayName": "Data Analysis in Python",
"language": "PYTHON",
"sparkVersion": "3.2.1",
"driverShape": "<driver_shape>",
"executorShape": "<executor_shape>",
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},
"numExecutors": 1,
"type": "SESSION",
#"poolId": "poolId",
#"privateEndpointId": "privateEndpointId",
"logsBucketUri": "oci://<bucket>@<namespace>/",
"configuration": {"dataflow.auth":"resource_principal", "spark.oracle.datasource.enabled":"true", "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2", "spark.executorEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED": "true", "spark.driverEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED": "true"}
}
%create_session -l python -c $command

You would then see these properties under the OCI Data Flow application;

Ensure you check and use this best practice. This will significantly increase the performance of your applications that are interacting with OCI Object Storage.

Best Practice 3 — Use Resource Principal Authentication

This best practice is primarily most useful for use of OCI Dataflow from Notebooks. Use the resource principal authentication property to optimizing the authentication, this will save around 60s on the cluster initialization. If you do this you will have to add policies statements for any resources you use as resource principal based (see the policies section in the documentation). This is done using the following property;

  • dataflow.auth=resource_principal

You can see how this is done in the configuration property below.

# Replace with your values;
# compartment id for creating the compute/application in OCI Dataflow
# bucket for writing OCI Dataflow logs
# namespace your OCI Object Storage namespace
# driver_shape ie. VM.Standard.E4.Flex
# executor_shape ie. VM.Standard.E4.Flex
command = {
"compartmentId": "<your_compartment_id>",
"displayName": "Data Analysis in Python",
"language": "PYTHON",
"sparkVersion": "3.2.1",
"driverShape": "<driver_shape>",
"executorShape": "<executor_shape>",
"driverShapeConfig":{"ocpus":1,"memoryInGBs":16},
"executorShapeConfig":{"ocpus":1,"memoryInGBs":16},
"numExecutors": 1,
"type": "SESSION",
"logsBucketUri": "oci://<bucket>@<namespace>/",
"configuration": {"dataflow.auth":"resource_principal", "spark.oracle.datasource.enabled":"true", "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2", "spark.executorEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED": "true", "spark.driverEnv.OCI_JAVASDK_JERSEY_CLIENT_DEFAULT_CONNECTOR_ENABLED": "true"}
}
%create_session -l python -c $command

This is useful since we are using a resource principal for the Data Science Notebook Session and can hint to the OCI Dataflow service to use this on initialization (today this has to be done explicitly).

Conclusion:

These best practices really help with modern, large workloads building LLM models and big data workloads. By adopting these strategies, you can create lean and efficient environments that enhance the reproducibility, portability, and speed of your data science workflows. Remember, a thoughtful and streamlined environment not only benefits your current project but also makes collaboration and deployment smoother in the long run.

References

https://docs.oracle.com/en-us/iaas/data-science/using/use-notebook-sessions.htm

https://blogs.oracle.com/dataintegration/post/notebooks-big-data-and-spark-jobs-in-oci

--

--

David Allan

Architect at @Oracle The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.