Publish to OCI Data Flow using Data Integration SDKs

David Allan
3 min readFeb 4, 2021

--

There’s a great blog here giving an overview of publishing OCI Data Integration tasks to OCI Data Flow service. Here we will see some tools to make it much easier!

OCI Data Integration provides an easy to use graphical interface to design data flows and tasks in a no-code approach which enables business users, ETL developers, and data engineers with a deep understanding of the data to develop their own data integration tasks without requiring any technical knowledge. OCI Data Flow allows you to run your own Spark scripts when running an application on OCI Data Flow, you can select the Oracle Cloud Infrastructure Compute shapes used for the driver, the executor, and the number of executors.

When a set of tasks are designed and ready to be published, users have a number of choices for publishing these tasks to OCI Data Flow. You can deploy from the OCI Console …

Then enter the information about where this is to be published to; which OCI Data Flow application to create, what shape to use as default, which Object Storage bucket to deploy the application JAR to and so on.

There’s a better way of doing this in an automated way and at scale. For example using the python SDK, its simple to read all the tasks from a project and deploy these,

The example below uses the create_external_publication function (see documentation here), here is an example;

import ociconfig = oci.config.from_file()dip = oci.data_integration.DataIntegrationClient(config)wsid = "ENTER_DI_WORKSPACE_OCID_HERE"project_key = "ENTER_DI_PROJECT_KEY_HERE"appcompid="ENTER_APPLICATION_COMPARTMENT_OCID_HERE"rc=oci.data_integration.models.ResourceConfiguration(spark_version="2.4.4",driver_shape="VM.Standard2.2",executor_shape="VM.Standard2.2",total_executors=1)# OCI Data Flow - location for application to be published in Object Storagedata_asset=oci.data_integration.models.DataAssetFromObjectStorageDetails(key="ENTER_DI_DATA_ASSET_KEY_HERE")connection=oci.data_integration.models.ConnectionFromObjectStorage(key="ENTER_DI_CONNECTION_KEY_HERE")schema=oci.data_integration.models.Schema(key="dataref:ENTER_CONNECTION_KEY/ENTER_BUCKET_NAME")comp="ENTER_BUCKET_COMPARTMENT_OCID_HERE"cdt=oci.data_integration.models.ConfigurationDetails(data_asset=data_asset, connection=connection, schema=schema, compartment_id=comp)tsklist = dip.list_tasks(workspace_id=wsid, folder_id=project_key, fields="metadata").data.itemsfor tsk in tsklist:    print("    Publishing task to OCI Data Flow: \t" + tsk.name)    details = oci.data_integration.models.CreateExternalPublicationDetails(application_compartment_id=appcompid, display_name=tsk.name, resource_configuration=rc, configuration_details=cdt)    extpub = dip.create_external_publication(wsid,tsk.key, create_external_publication_details=details)    print(extpub.data)

This uses the CreateExternalPublicationDetails class which captures all the information for the publish;

This simple piece of code captures the publication to OCI Data Flow in a script that can then be used for publishing your tasks in an automated manner; change the shape you need for the job and all the details about where its to be published to, simple!

Choose the right tool for the job, use of SDKs can make life easy and automated

Conclusion

Here we have seen that we can easily publish an OCI Data Integration task to OCI Data Flow. This capability helps you gather all your Spark jobs together and have further control of your driver and executor shapes. You can also automate the process of publishing and running your task in OCI Data Flow right after it’s ready.

For more how-tos and interesting reads, check out Oracle Cloud Infrastructure Data Integration blogs.

--

--

David Allan
David Allan

Written by David Allan

Architect at @Oracle The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

No responses yet