Container Instances — Run Dockerized ETL in Data Integration
Have you heard about the OCI Container Instances service? See the release blog here. They are a great way for running custom jobs in a serverless manner.
Often I see requests of how do I run some custom python in the cloud, how can I run an Essbase MAXL script, how can I run some existing tool? The list goes on. What people are asking is how do I containerize my ETL scripts, how do I build a portable, isolated environment that I can run locally and in the cloud. We will see how to run our scripts from within the container, which will allow us to not worry about the necessary libraries and other needed packages being installed on the target system or environment. All required dependencies will be self-contained in the container. The container can access your private network also securely (when the container is created you can specify a subnet to securely attach to)— so the script can access databases and other resources securely. Bliss.
Container Instances on Oracle Cloud Infrastructure can be leveraged as serverless container runtimes — you don’t have to standup a compute instance and monitor its lifecycle and billing, you do it on demand, when and if you need it. With the OCI Container Instance service you define which specific docker images should run on the instance — they can be small or large jobs with short or long running length. For you, not having to configure, provision and manage instances is a big benefit as well as it being a containerized image of course! See the documentation page here and a great post from Lucas Jellema here which details the functionality.
The container instance service has a simple interface to create the container instance, run the containers and manage the instance. All can be wrapped in a REST task and executed from within OCI Data Integration. This serverless, cloud native solution makes it easy for you to integrate custom dockerized jobs in a cloud-scale and secure manner.
You can leverage an existing REST Task for running your containers and schedule and orchestrate in ODI Data Integration. There is a “Run Container Instance” REST Task within a Postman collection in the blog post here that you can use to run any container based job on OCI Container Instances — this can be any docker image from OCI Registry (OCI’s docker registry) or other hosted docker registries (such as docker hub). Below you’ll see how a pre existing image (bash from dockerhub) is used in a hello world example and then how an example python script manipulating OCI Object Storage is used. The OCI example is a great example that can then be used with any of the other OCI services such as the AI services, Data Safe, Golden Gate and so on, the authentication uses OCI Identity’s resource principal based authentication so is secure and safe. If you needed to access any credentials for databases you can use OCI Vault and store secrets there.
The Hello World Example
A basic hello world style example is simply to run a public container like bash or python (the image is already built). Running this container instance you will see the container be created, the container started and then exiting and the container instance deleted.
Running the container instance for this basic example is equivalent to running the bash docker image as;
docker run bash
Its not doing anything terribly exciting but we can first run this from Container Instances to check it out — see the other blogs and documentation, its pretty straightforward below you can see when I am selecting the docker image I enter the image name (bash) that’s coming from dockerhub here;
To test from OCI Data Integration, the REST Task as provided in link above will poll for the container to be deleted prior to completing — if you do not wish this behavior, you can remove the polling from the task also if you want to launch and return.
Below you can see the REST task in OCI Data Integration which kicks off a Container Instance. The task can also be schedule on a recurring basis or added into a data pipeline and orchestrated! Below you can see the POST API call and a snippet of the payload within the REST task definition.
The REST task is a starter/example that has a reasonable amount defined, you can add in more, below we can see we have the image URL, the environment variables and much more (you could also parameterize the command and entrypoint arguments for example);
{
"containers": [
{
"imageUrl": "${CONTAINER_URL_PATH}",
"displayName": "container-20230103-1629-1",
"environmentVariables": { ${ENVIRONMENT_VARS} },
"definedTags": {},
"freeformTags": {}
}
],
"compartmentId": "${COMPARTMENT_ID}",
"availabilityDomain": "${AVAILABILITY_DOMAIN}",
"shape": "${SHAPE}",
"shapeConfig": {
"ocpus": ${OCPUS},
"memoryInGBs": ${MEMORY_IN_GB}
},
"vnics": [
{
"subnetId": "${SUBNET_ID}",
"displayName": "${VCN_NAME}",
"isPublicIpAssigned": true,
"skipSourceDestCheck": true
}
],
"displayName": "${CI_DISPLAY_NAME}",
"gracefulShutdownTimeoutInSeconds": ${GRACEFUL_SHUTDOWN_TIME},
"containerRestartPolicy": "${CONTAINER_RESTART_POLICY}"
}
When we run the task we can enter these values;
Or configure them in a schedule or a pipeline — you can also change the defaults in the task if that’s good enough.
Custom Image — run some Python
Moving on to a custom container, we can see below our Dockerfile is dependent on python 3 and we install some Python requirements (OCI SDK dependencies) and run a Python script. The script is an example to show how to create a bucket and upload a file into OCI Object Storage — the point here is to create a unit of work with much work to do.
FROM python:3
WORKDIR /usr/src/app
RUN pip3 --version
COPY requirements.txt ./
RUN pip3 install --no-cache-dir -r requirements.txt
COPY data_integration_script.py .
CMD [ "python3", "./data_integration_script.py" ]
The requirements.txt file looks like this below — our example is quite simple it only depends on the OCI package — we can use all kinds of OCI services within this;
oci
The data_integration_script.py used in example is your code, below is an example that creates a bucket and uploads a file it uses resource principal since it is launched from Container Instances service — the parameters are passed as environment variables when the container is run;
import sys
import oci
def create_data_in_os(rps,namespace,compartment,bucket_name,object_name):
bytes = b'col1,col2\n1,"a"'
object_storage = oci.object_storage.ObjectStorageClient({}, signer=rps)
try:
bd = oci.object_storage.models.CreateBucketDetails(compartment_id=compartment,name=bucket_name)
bucket = object_storage.create_bucket(namespace_name=namespace,create_bucket_details=bd)
except:
print("INFO: bucket already exists")
try:
object_storage.put_object(namespace_name=namespace,bucket_name=bucket_name,object_name=object_name,put_object_body=bytes)
print("Data file " + object_name + " uploaded to bucket " + bucket_name)
except:
print("INFO: object already exists")
rps = oci.auth.signers.get_resource_principals_signer()
namespace=os.environ.get('NAMESPACE')
compartment=os.environ.get('COMPARTMENT_ID')
bucket=os.environ.get('BUCKET_NAME')
entity=os.environ.get('OBJECT_NAME')
create_data_in_os(rps,namespace,compartment,bucket,entity)
This kind of script is an example custom parameterized script that you can dockerize and build your own logic, you can then invoke this via the serverless Container Instance service from the rest of your data integration pipeline.
We build the docker image then publish to OCI Registry for example. We can then execute via the REST Task in OCI Data Integration. Below you can see the OCI Registry document image URL for my image and also the parameters are passed for the environment variables that your script will use.
To invoke this from OCI Data Integration you will need resource principal policies setup for using the Container Service (see here).
You can execute the REST task via SDKs, from within a pipeline or schedule the task from within Data Integration.
Summary
Hope you found this useful, Docker is really a fantastic way of building portable environments that you can easily run, there are so many possibilities and frameworks for building smart solutions also, more on that to come. Check out the documentation for OCI Data Integration below. Send me comments, questions and ideas, would love to hear them.
Related resources
- OCI Object Storage (documentation)
- OCI Data Integration (documentation)
- OCI Container Instances (documentation)