Discovering Entities in Documents and Images with OCI Vision and Data Integration

David Allan
6 min readApr 28, 2022

As consumers we take for granted a lot of things at our fingertips everyday don’t we? We flip through our phones with ease and snap away at images all day long. Documents, images, sound and movies all have valuable insights over and above structured data. In this example we will see how we can create a task in OCI Data Integration that we can schedule and execute on a recurring basis or integrate into a data pipeline or code.

The OCI AI family of services provide insights into all kinds of information that can then be further analyzed and processed. Here we will see how we can integrate these in a no-code environment such as OCI Data Integration.

Let’s look at the example document provided in the AI Vision service below, this is a PDF, there’s a lot of information that can be extracted here, from general text on the page to tabular information;

Example table in a PDF document from OCI AI Vision service

The AI Vision service will extract this information and can be stored in Object Storage. There’s lots of information we can extract here and analyze across across our data lake.

When we use the Document AI features within the Vision service we can extract information that otherwise we would not have known. The Vision service can automatically extract word- and line-level text from documents, it can also automatically identify and extract table structure from documents including the row and column location of each cell within tables (such as example above).

Vision API in action within OCI Console

To incorporate this in OCI Data Integration we can create a REST Task — these tasks like other tasks can be parameterized, executed, scheduled and incorporated into data pipelines. There are a few items that need to be defined;

  1. the POST API call to create the document job
  2. the retrieval of the document job ocid
  3. poll on the document job using the ocid from point 2 until its completed
  4. support terminate document job

Let’s look at these in detail.

Creating the Document Job

Below we are creating a new REST task (see OCI Data Integration documentation here) and using CreateDocumentJob API which is a REST POST API on the Vision service. The URL used below is

https://vision.aiservice.${REGION}.oci.oraclecloud.com/20220125/documentJobs

Here I have parameterized the region using ${REGION} then I can use this task in different regions also.

Create a REST Task in Data Integration

When you hit return after entering the URL above, a new URL parameter will be added to the table, see REGION has been added. We will have to edit the parameter and define a value.

REGION parameter has been added.

When we edit the REGION parameter we can define the region value such as ‘us-ashburn-1’ below.

Define REGION parameter value

Define the headers, we will add Content-Type as application/json…

Content-Type header with value.

then add Accept header with value application/json…

Accept header with value.

Define the payload, in this example as you can see below I have defined the request body;

This is the actual content of the request body above that I used, there are 2 PDFS that are being analyzed;

{
"features":[
{"featureType":"TEXT_DETECTION","generateSearchablePdf":true},
{"featureType":"DOCUMENT_CLASSIFICATION","maxResults":5},
{"featureType":"LANGUAGE_CLASSIFICATION","maxResults":5},
{"featureType":"KEY_VALUE_DETECTION"},
{"featureType":"TABLE_DETECTION"}],
"inputLocation": {
"sourceType": "OBJECT_LIST_INLINE_INPUT_LOCATION",
"objectLocations": [
{
"bucketName": "a_delta_archive",
"namespaceName": "mynamespace",
"objectName": "quarter_numbers_abc.pdf"
},
{
"bucketName": "a_delta_archive",
"namespaceName": "mynamespace",
"objectName": "quarter_numbers_xyz.pdf"
}
]
},
"outputLocation": {
"bucketName": "a_delta_archive",
"namespaceName": "mynamespace",
"prefix": "visionout"
},
"compartmentId": "ocid1.compartment.oc1..mycompartment",
"displayName": "visiondata",
"isZipOutputEnabled": false
}

After the payload is defined then you can define what the success condition is, I am going to change it slightly for illustration;

Below you can see I added a check on the lifecycle state;

SYS.RESPONSE_STATUS >= 200 AND SYS.RESPONSE_STATUS <= 300 AND CAST(json_path(SYS.RESPONSE_PAYLOAD, '$.lifecycleState') AS String) == 'SUCCEEDED'
Success condition

Retrieving the Document Job OCID

The documentJobs API is an asynchronous call, so if we want this task to wait until the Vision job is actual complete, we can use the poll feature within Data Integration. Select the check box to configure a polling and termination condition…

Define a polling API call

We can then define an expression to get the document job OCID from out REST task action. Define the name DOCUMENT_JOB_OCID as a VARCHAR with expression below, this gets the id attribute from the response payload from our POST…

CAST(json_path(SYS.RESPONSE_PAYLOAD, '$.id') AS String)

Below you can see it defined within the expression…

Get the document job OCID

Polling on the Document Job

To poll the job, we will use the get document job API from Vision. We can define the URL to use the expression name

https://vision.aiservice.${REGION}.oci.oraclecloud.com/20220125/documentJobs/#{DOCUMENT_JOB_OCID}

Enter the URL below with the GET http method on GetDocumentJob

Define the polling API

The polling condition is as below

CAST(json_path(SYS.RESPONSE_PAYLOAD, '$.lifecycleState') AS String) != 'SUCCEEDED' AND CAST(json_path(SYS.RESPONSE_PAYLOAD, '$.lifecycleState') AS String) != 'FAILED'AND CAST(json_path(SYS.RESPONSE_PAYLOAD, '$.lifecycleState') AS String) != 'TERMINATED'

Here is where it is defined..

Define the polling condition

Terminating a Document Job

We can define the termination call similarly using the cancelDocumentJob API

https://vision.aiservice.${REGION}.oci.oraclecloud.com/20220125/documentJobs/#{DOCUMENT_JOB_OCID}/actions/cancel

Below is what the panel should look like;

Define the terminate/cancel API

Finally define the authentication mechanism as OCI resource principal…

Use resource principal

We should now have a valid task;

REST task is completed

How to make this more generic? Parameters! We can define values for different parameters in the payload, for example in the request body, you can define parameters for either the entire payload or for specific properties within. The example below I have parameterized many aspects within the payload;

There you go, we can publish the task and execute. Once this is executing if you view the bucket you will see the files created date and entities extracted from our documents and images. Hope you thought this was interesting and useful. Please share you thoughts, good or bad :)

--

--

David Allan

Architect at @Oracle The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.