Best Practices in OCI Data Integration

David Allan
7 min readAug 28, 2023

--

Note the following best practices when designing a data integrations.

Workspace resource and Application lifecycle

Create a workspace for development, test and production, create the workspaces in their own compartments, this will let you follow security best practices. Copy/synchronize tasks across the workspaces. This will help secure and control your solution. Whilst many applications can be copied/synchronized within the workspace, scaling and security are performed at the workspace level. See the documentation here for more information and this blog on automating.

Export and Import your Artifacts

The objects in OCI Data Integration can be exported and imported between workspaces, that goes for design objects in projects or published tasks in applications. The export and import is performed on the project or application or items within (see the documentation here) or the blog post here.

Create one Data Asset for a Database

Keep order to your data assets and name them consistently to easily find and understand what they mean. Use multiple connections for a single data asset if you have different uses for it rather than creating additional data assets.

Use default staging in Autonomous Data Assets

You must use default staging in autonomous data assets in order to effectively promote tasks through development, test and production lifecycles.

Connect privately to a database that has NSG setup

You will need use private networking within your workspace in OCI Data Integration and you will need to add the subnet CIDR in the ingres rules in your NSG.

Private networking and Gateways

If you are getting connectivity errors like ‘Bad gateway’ check that a Service Gateway with all region-wide services has been setup for the network you are bridging to from OCI DI (the network you selected when you created the workspace). For public routes, ensure you have a NAT Gateway in your route table.

Setup policies for Workspace or Application Resource

Whilst you may design the tasks and see buckets, objects and so forth, when the task is executed, data integration uses resource principles such as Workspace (all tasks) or Application (only for REST) resource when executing tasks to define the authorization. So when you see errors like below

com.oracle.bmc.model.BmcException: (404, BucketNotFound, false) Either the bucket named ‘abc-bucket’ does not exist in the namespace ‘xyz’ or you are not authorized to access it

this is because your workspace resource for example does not have permissions to access it. This goes for all referenced OCI services — for example above is for OCI Object Storage, but if your REST task leverages Data Science, Data Flow, Container Instances and so on, you will need appropriate policy statements for those. You need to create either a dynamic group and policy for the providing access to the artifact or use the ‘allow any-user’ style syntax. Check this section of the documentation for details on policies.

How to load multiple targets from a single source

Dataflow designs uses standard ETL operators and you can have multiple distinct flows in the dataflow, you can order the execution by target operator, you can also process the output of one operator many times into different targets and also split a flow by condition into multiple targets.

Execute one task for many different sources or targets

Data Integration tasks can be parameterized and you can parameterize everything from the data asset, to the connection, schema, entity and have parameters for conditions and values. You can see the blog here on ‘Reusing Tasks in OCI Data Integration’ that gives many use cases for reuse.

How do you pass a name for the file as an input or output parameter?

You can pass scalar parameters such as a filename for example that can be used in the place of an entity name. Entities can be parameterized in as the entity type or you can define the name as a scalar string parameter and reference it as ${MY_PARAM}

What’s the best practice to monitor executions

Task executions result in emitting OCI Events when tasks get completed successfully or fail. You also have an ability to set up ‘expected time’ for task execution when scheduling a task and when using a Task in a Pipeline flow. Data Integration integrates with OCI public logging and monitoring for visibility by the users. Check this blog here for ‘Managing by Exceptions’ using OCI Events (also this blog), here you can get an email when a specific task type fails. You can also analyze metrics emitted by Data Integration and produce alarms based on exception criteria from the metric values.

You can get notifications (email, pagerduty, SMS and so on) on OCI Data Integration task error. Using the rules in the Events service you can filter by task status = ERROR on task completed event type.

How do you extract and process only changes?

Data Integration supports both managed and custom incremental extract from systems. For example managed incremental extract allows you do simply define a source like a Fusion application object and schedule daily extracts from that object, the service takes care of the watermark and extracting the changes on each execution (see documentation here). You can also build custom parameterized filters to extract and manage the state yourself (check this older blog).

How do you load many files at once?

Tasks in data integration can process logical groups of objects in OCI Object Storage and other object systems using file patterns. Check this blog for details.

Synchronous Best Practices in REST tasks

When executing REST tasks, there is a one-minute limit when the REST API is expected to respond with a status. If you are using an OCI Function although OCI Functions can run for up to 5 minutes, when executing from OCI Data Integration you should use the property fn-invoke-type with value detached in the InvokeFunction endpoint when calling your function. If you want to poll on a result, you will have to have some persistent state such as an object in Object Storage that you check on in the polling API.

How do you synchronize Applications?

You can copy/synchronize applications within and across workspaces, this is a useful feature for promoting applications through life cycle. This can be done in the OCI Console or via the SDK (see this blog here). This illustrates how to use the CreateApplication API to create the application copy and also how to use CreatePatch to synchronize.

How do you execute a Data Science Notebook in a pipeline or schedule

You can use Data Science jobs to execute books, here we can see in this blog how to create a REST task that executes a Notebook in Data Science from a pipeline or directly from Data Integration.

How do you transform similar attributes with same transformation?

Within the dataflow designer and data loader task you can use macros to transform many attributes with one single expression. For example if you want to perform a calculation or change a datatype with many similar attributes you can use this best practice. It lets you build expressions that work on patterns of attributes such that they don’t have to do the mundane work of defining the same transformation over and over (date formatting, number formatting, common calculations, list goes on and on). It also handles subsequent new attributes later in the data flow’s life such that you do not have keep baby-sitting changes. See here for more information.

How do you transform complex types?

Higher order functions make transformation of complex data easy. OCI Data Integration supports processing of files with nested hierarchy and data types, in the blog here there is information on datatypes and higher order functions specifically related to nested hierarchy types and JSON including transformation of composite types, arrays and map types.

How do you invoke REST APIs with OAuth2 or Basic Auth?

REST Tasks in data integration currently can be authenticated using OCI authentication or none. If you need to use an API that supports basic authentication or OAuth2 then you can define the API in the OCI API Gateway and call the API from OCI Data Integration. See https://blogs.oracle.com/dataintegration/post/invoke-rest-apis-from-rest-tasks-with-api-gateway here for how to do this.

Use REST Task for Integrating any Custom Task

The REST Task can be used to integrate any code, that can be absolutely anything, the task has a main endpoint and also an polling endpoint and a delete endpoint. This example here uses the OCI Vision api to execute the Vision API and then poll on completion. https://blogs.oracle.com/dataintegration/post/discovering-entities-in-documents-and-images-with-oci-vision-and-data-integration

There is a post here on the variety of tasks, this could be anything from your own container, to OCI services, to containers from Microsoft or Google that invoke their CLIs!

Best practice for wait on data and application state using Oracle NoSQL

This best practice shows how to save variable state for applications in OCI NoSQL. This can be easily shared across your applications. Within data integration there is often the need to trigger processing with time and also sometimes with data; when data or some data values arrive in a data store and then trigger some processing. There is also the need to persist data in global variables, a practice used in application development since time began — check this post here for how to use durable state store in OCI NoSQL.

Best practice for creating target file names dynamically

In the blog here, is an illustration for how to create target file names dynamically, for example create a target file with timestamp appended like addresses_<system_timestamp>.csv in Oracle Cloud Infrastructure (OCI) Data Integration.

--

--

David Allan
David Allan

Written by David Allan

Architect at @Oracle The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

No responses yet