Spark Parallel Export from Databases to Object Storage using OCI Data Integration

In this blog we will look at leveraging the OCI Dataflow application from this post here from within OCI Data Integration, this application supports parallel export from databases into partitioned data in Object Storage using Apache Spark. We will see how you can use the task in OCI Data Integration to orchestrate in a data pipeline or schedule on a recurring basis.

After the OCI Dataflow application has been created, create a task in OCI Data Integration;

Create an OCI Data Flow task.

You can define a name and description for the task, then select the OCI Data Flow application you have created by selecting the Edit button.

Define the name, description and the application to run.

In the dialog for selecting the OCI Data Flow application select the compartment where your application resides and then select the application.

Select the OCI Data Flow application

You can configure the parameters for the application this includes the shape for the driver and executor, the number of executors — there are also parameters for enabling auto scaling in the Spark configuration.

In the arguments list you can specify all of the information for the application you have published previously.

In OCI Data Integration in this arguments property you will have to specify these values with comma delimiters as follows.

The comma is needed, the information tool tip has details, its useful to add a parameter for this property also, then you can run the task for all different inputs and outputs — Arguments that will be used to invoke the main class. Define parameters using Spark shell syntax. To pass multiple arguments use comma for separating the arguments.

You can also specify Spark configuration properties such as spark.sql.files.maxRecordsPerFile (see documentation here). This is useful if you are producing data for some downstream system that cannot handle large files.

In addition, if you want to enable auto scaling, define the properties with the values you required (the properties beginning with spark.dynamicAllocation and spark.dataflow);

Of course as the post mentioned you can also specify partitioning in formation for the target in the Arguments value— then you can partition by a value such as a date for example rather than purely by size.

With that, our task is defined! We need to ensure that the workspace resource policy has access to our OCI Dataflow application. See the policies in the OCI Data Integration documentation here. When the policies are defined you should be able to successfully execute, schedule and orchestrate your task.

Hopefully some useful information for you here, let me know what you think, would love to hear. These examples were related to Oracle but all other JDBC accessible sources can be used. Let me now if there are other ones you are interested in.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
David Allan

Architect at @Oracle developing cloud services for data. Connect on Twitter @i_m_dave