Multi-Cloud: Copying Data from Azure Data Lake to Oracle’s OCI Object Storage using OCI Data Flow

David Allan
6 min readJun 13, 2023

--

In today’s digital landscape, organizations often find themselves working with multiple cloud providers to leverage the best features and capabilities offered by each platform. This multi-cloud approach enables businesses to distribute their workloads, optimize costs, and avoid vendor lock-in. In this blog post, we will explore a real-world scenario where we copy data from Azure Data Lake to Oracle’s OCI Object Storage using OCI Data Flow, showcasing the power of a multi-cloud architecture.

To begin, let’s familiarize ourselves with the key components of our architecture:

1. Azure Data Lake: Microsoft Azure’s scalable and secure cloud-based storage system designed for big data processing. It allows businesses to store and analyze massive amounts of data in various formats.

2. Oracle OCI Object Storage: Oracle Cloud Infrastructure’s Object Storage offering, which provides a highly scalable and durable cloud-based storage solution. It enables organizations to store, process, and analyze vast amounts of data using various data processing technologies.

3. OCI Data Flow: OCI Data Flow is a fully managed, serverless data processing service provided by Oracle Cloud Infrastructure. It allows you to run Apache Spark applications without managing the underlying infrastructure, making it an ideal choice for data transformation and processing. Apache Spark, a popular open-source big data processing framework, can be deployed in a serverless fashion.

Copying Data from Azure Data Lake to OCI Object Storage

Now, let’s dive into the step-by-step process of copying data from Azure Data Lake to OCI Object Storage using OCI Data Flow:

Step 1: Configure Azure Data Lake Access:

Ensure that you have appropriate access and permissions to read data from your Azure Data Lake. Set up the necessary credentials, such as Azure Storage Account and Azure AD authentication, to establish connectivity. In my environment I create a service principal for the demo to access the information. You can see below I have configured the specific role to access the data.

An Azure service principal is an identity created for use with applications, hosted services, and automated tools to access Azure resources like Data Lake. The access is restricted by the roles assigned to the service principal, giving you control over which resources can be accessed and at which level.

To access Azure Data Lake, we will use the Hadoop Filesystem driver that is compatible with Azure Data Lake Storage Gen2, it is known by its scheme identifier abfs (Azure Blob File System). Consistent with other Hadoop Filesystem drivers, the ABFS driver employs a URI format to address files and directories within a Data Lake Storage Gen2 capable account.

Below I am going to use the most verbose way of defining the URL for conciseness and completeness in aData Lake Storage Gen2 capable account, the shorthand URI syntax is:

abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>
  1. Scheme identifier: The abfs protocol is used as the scheme identifier. If you add an s at the end (abfss) then the ABFS Hadoop client driver will always use Transport Layer Security (TLS) irrespective of the authentication method chosen.
  2. File system: The parent location that holds the files and folders. This is the same as containers in the Azure Storage Blob service.
  3. Account name: The name given to your storage account during creation.
  4. Paths: A forward slash delimited (/) representation of the directory structure.
  5. File name: The name of the individual file. This parameter is optional if you’re addressing a directory.

Check the Azure documentation for options to simplify the URL.

Step 2: Set Up OCI Object Storage:

Create a bucket in OCI Object Storage in Oracle Cloud Infrastructure and setup the appropriate policies in order to write data into this bucket from your OCI Data Flow application. This will serve as the destination for our data.

Step 3: Prepare Spark Application:

Develop a Spark application using the programming language of your choice (e.g., Scala, Python) to read data from Azure Data Lake. Leverage the Azure Blob Storage connector or appropriate libraries to interact with Azure Data Lake and retrieve the required data, below we use the ABFSS to connect with Azure (we can see the library versions that are used by OCI Data Flow here). The client secret is stored in OCI Vault, store the secret there — the rest of the properties I have added to the code and could be parameterized (Azure storage_account_name, container_name, client_id etc along with the OCI properties like bucket_name).

Step 4: Configure OCI Data Flow:

Using the OCI Console or APIs, create an OCI Data Flow application — you can also test the above code using OCI Code Editor from within OCI Console or within OCI Data Science notebooks! Specify any necessary configuration parameters, including the dependent application jar file, input data source (Azure Data Lake), and output data sink (OCI Object Storage).

The PySpark above is dependent on the Microsoft Azure Data Lake and Hadoop JAR files, specifically these versions right now, add this to your packages.txt file;

https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/3.3.1/hadoop-azure-3.3.1.jar
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure-datalake/3.3.1/hadoop-azure-datalake-3.3.1.jar
https://repo1.maven.org/maven2/org/codehaus/jackson/jackson-mapper-asl/1.9.9/jackson-mapper-asl-1.9.9.jar

The PySpark also depends on the OCI python SDK for the retrieval of the secret from the OCI vault. To use this you will also need a python requirements.txt file including the OCI SDK;

oci

Follow the instructions here to build an archive.zip file and upload this to OCI Object Storage (see the instructions here for building a zip with the 3rd party dependencies) — make sure you use the same python version as you will run with (otherwise you’ll get errors like ModuleNotFoundError: No module named ‘oci’)!

You are now ready to run your pyspark, below I am creating an OCI Data Flow application within the OCI Console, I have defined the name, the version of Spark for above is 3.2.1 (which is using Python 3.8) and the shape for my computes for this demo;

I then define where the pyspark resides; the bucket and object name (azire_to_oci.py) and also the archive.zip location containing the Azure hadoop and data lake JARs.

Step 5: Launch then Monitor OCI Data Flow Job:

You can then run the application and monitor in the application runs panel;

You’ll see the application go through the phases from being accepted to running and success.

Submit the OCI Data Flow job, triggering the execution of the Spark application. OCI Data Flow will automatically provision and manage the necessary resources, including the compute instances, required for processing the data.

Monitor the progress and status of the OCI Data Flow job using the provided monitoring and logging tools. Once the job completes successfully, validate the copied data in the OCI Object Storage to ensure the integrity of the transfer.

This is easy to do and now you can read and write to/from Azure Data Lake.

Conclusion:

Building a multi-cloud architecture allows organizations to harness the unique strengths of different cloud platforms. In this blog post, we explored the process of copying data from Azure Data Lake to Oracle’s OCI Object Storage using OCI Data Flow. By leveraging the power of serverless computing and managed data processing services, businesses can seamlessly move data between clouds, enabling them to leverage the best-in-class features and capabilities of each cloud provider. This flexibility and agility play a crucial role in meeting the evolving demands of modern.

Check out this OCI reference architecture on how to Implement a multicloud data lake integration architecture for more information.

Like what you see? Please check out the OCI Data Flow architecture and check what you can do in OCI Data Flow here;

https://www.oracle.com/big-data/data-flow/

--

--

David Allan

Architect at @Oracle The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.