Discovering Entities in Documents and Images with OCI Vision and Data Integration
As consumers we take for granted a lot of things at our fingertips everyday don’t we? We flip through our phones with ease and snap away at images all day long. Documents, images, sound and movies all have valuable insights over and above structured data. In this example we will see how we can create a task in OCI Data Integration that we can schedule and execute on a recurring basis or integrate into a data pipeline or code.
The OCI AI family of services provide insights into all kinds of information that can then be further analyzed and processed. Here we will see how we can integrate these in a no-code environment such as OCI Data Integration.
Let’s look at the example document provided in the AI Vision service below, this is a PDF, there’s a lot of information that can be extracted here, from general text on the page to tabular information;
The AI Vision service will extract this information and can be stored in Object Storage. There’s lots of information we can extract here and analyze across across our data lake.
When we use the Document AI features within the Vision service we can extract information that otherwise we would not have known. The Vision service can automatically extract word- and line-level text from documents, it can also automatically identify and extract table structure from documents including the row and column location of each cell within tables (such as example above).
To incorporate this in OCI Data Integration we can create a REST Task — these tasks like other tasks can be parameterized, executed, scheduled and incorporated into data pipelines. There are a few items that need to be defined;
- the POST API call to create the document job
- the retrieval of the document job ocid
- poll on the document job using the ocid from point 2 until its completed
- support terminate document job
Let’s look at these in detail.
Creating the Document Job
Below we are creating a new REST task (see OCI Data Integration documentation here) and using CreateDocumentJob API which is a REST POST API on the Vision service. The URL used below is
https://vision.aiservice.${REGION}.oci.oraclecloud.com/20220125/documentJobs
Here I have parameterized the region using ${REGION} then I can use this task in different regions also.
When you hit return after entering the URL above, a new URL parameter will be added to the table, see REGION has been added. We will have to edit the parameter and define a value.
When we edit the REGION parameter we can define the region value such as ‘us-ashburn-1’ below.
Define the headers, we will add Content-Type as application/json…
then add Accept header with value application/json…
Define the payload, in this example as you can see below I have defined the request body;
This is the actual content of the request body above that I used, there are 2 PDFS that are being analyzed;
{
"features":[
{"featureType":"TEXT_DETECTION","generateSearchablePdf":true},
{"featureType":"DOCUMENT_CLASSIFICATION","maxResults":5},
{"featureType":"LANGUAGE_CLASSIFICATION","maxResults":5},
{"featureType":"KEY_VALUE_DETECTION"},
{"featureType":"TABLE_DETECTION"}],
"inputLocation": {
"sourceType": "OBJECT_LIST_INLINE_INPUT_LOCATION",
"objectLocations": [
{
"bucketName": "a_delta_archive",
"namespaceName": "mynamespace",
"objectName": "quarter_numbers_abc.pdf"
},
{
"bucketName": "a_delta_archive",
"namespaceName": "mynamespace",
"objectName": "quarter_numbers_xyz.pdf"
}
]
},
"outputLocation": {
"bucketName": "a_delta_archive",
"namespaceName": "mynamespace",
"prefix": "visionout"
},
"compartmentId": "ocid1.compartment.oc1..mycompartment",
"displayName": "visiondata",
"isZipOutputEnabled": false
}
After the payload is defined then you can define what the success condition is, I am going to change it slightly for illustration;
Below you can see I added a check on the lifecycle state;
SYS.RESPONSE_STATUS >= 200 AND SYS.RESPONSE_STATUS <= 300 AND CAST(json_path(SYS.RESPONSE_PAYLOAD, '$.lifecycleState') AS String) == 'SUCCEEDED'
Retrieving the Document Job OCID
The documentJobs API is an asynchronous call, so if we want this task to wait until the Vision job is actual complete, we can use the poll feature within Data Integration. Select the check box to configure a polling and termination condition…
We can then define an expression to get the document job OCID from out REST task action. Define the name DOCUMENT_JOB_OCID as a VARCHAR with expression below, this gets the id attribute from the response payload from our POST…
CAST(json_path(SYS.RESPONSE_PAYLOAD, '$.id') AS String)
Below you can see it defined within the expression…
Polling on the Document Job
To poll the job, we will use the get document job API from Vision. We can define the URL to use the expression name
https://vision.aiservice.${REGION}.oci.oraclecloud.com/20220125/documentJobs/#{DOCUMENT_JOB_OCID}
Enter the URL below with the GET http method on GetDocumentJob…
The polling condition is as below
CAST(json_path(SYS.RESPONSE_PAYLOAD, '$.lifecycleState') AS String) != 'SUCCEEDED' AND CAST(json_path(SYS.RESPONSE_PAYLOAD, '$.lifecycleState') AS String) != 'FAILED'AND CAST(json_path(SYS.RESPONSE_PAYLOAD, '$.lifecycleState') AS String) != 'TERMINATED'
Here is where it is defined..
Terminating a Document Job
We can define the termination call similarly using the cancelDocumentJob API
https://vision.aiservice.${REGION}.oci.oraclecloud.com/20220125/documentJobs/#{DOCUMENT_JOB_OCID}/actions/cancel
Below is what the panel should look like;
Finally define the authentication mechanism as OCI resource principal…
We should now have a valid task;
How to make this more generic? Parameters! We can define values for different parameters in the payload, for example in the request body, you can define parameters for either the entire payload or for specific properties within. The example below I have parameterized many aspects within the payload;
There you go, we can publish the task and execute. Once this is executing if you view the bucket you will see the files created date and entities extracted from our documents and images. Hope you thought this was interesting and useful. Please share you thoughts, good or bad :)