Leveraging dbt with OCI Dataflow: Supercharging Your Data Analytics Workflow on OCI

David Allan
7 min readJan 31, 2024

In the realm of modern data analytics, the tools and technologies available continue to evolve rapidly, offering more efficient and scalable solutions for managing and processing data. Two such tools that have gained significant traction in recent years are dbt (data build tool) and OCI Dataflow, Oracle’s serverless Spark offering. Combining the power of dbt with the scalability and ease of use of OCI Dataflow can supercharge your data analytics workflow, enabling you to derive valuable insights from your data more efficiently than ever before. In this post you will see how this can be done, there is one open issue on dbt for the spark adapter and a fork with example code provided.

We take a sample DBT project which was running on Databricks and execute on OCI Dataflow. Let’s see how this is done.

What is dbt?

Dbt, or data build tool, is an open-source software package that enables analysts and engineers to transform data in their warehouses or data lakes more effectively. It allows users to write SQL queries to transform, model, and document data, providing a robust framework for managing the entire data transformation process.

Dbt is an open source data modeling tool used by analysts and engineers for writing SQL queries. Data Pipelines are defined in files and organized in folders and directories, just as plain text so deployment, testability, and version control are fairly simple when you know what you are doing.

High level image of what DBT does

Not only does dbt simplify complex queries, but it also reduces the need for repeatable lines of code. It turns queries into blocks of code that execute data transformation tasks inside your warehouse. Dbt runs code against the target system (data lake/database), after compiling it into SQL.
To understand dbt, we must understand ELT. ELT stands for Extract, Load, and Transform. This type of architecture focuses on transforming data after loading. Dbt is the T in ELT. It transforms existing data in your warehouse, but it doesn’t deal with its extraction or load you will have to.

Introducing OCI Dataflow

OCI Dataflow is Oracle’s fully managed, serverless Apache Spark service, designed to process large-scale data workloads with ease. With OCI Dataflow, users can run Apache Spark applications without having to manage infrastructure, allowing them to focus on building and optimizing their data pipelines instead of worrying about server provisioning, scaling, or maintenance. OCI Dataflow offers seamless integration with other Oracle Cloud services, making it an ideal choice for organizations looking to harness the power of Spark for their data analytics needs.

OCI Dataflow provides both regular Spark application support using Spark Scala and PySpark along with Spark SQL. It also provides a SQL endpoint capability so allowing all BI Tools to utilize Spark and query on Delta tables, databases and so on.

OCI Dataflow SQL Endpoint provides a serverless, scalable environment for running SQL queries directly against data stored in object storage, data lakes or databases like Oracle Autonomous Database. This eliminates the need for managing infrastructure and allows for seamless integration with existing data sources.

Combining dbt with OCI Dataflow

By integrating dbt with OCI Dataflow, organizations can leverage the strengths of both tools to create a streamlined and efficient data analytics workflow. Leveraging the dbt-spark module, DBT can generate Spark SQL and execute using OCI Dataflow. The SQL endpoint for OCI Dataflow is used and the ODBC driver from OCI Dataflow used to connect to the system.

This gives you:

  1. Data Transformation with dbt: Analysts and engineers can use dbt to define their data transformation logic using SQL-based models. Dbt’s modular approach allows for easy organization and management of complex data pipelines, making it simple to maintain and iterate upon.
  2. Seamless Integration with OCI Dataflow: Once the data transformation logic is defined using dbt, it can be seamlessly integrated with OCI Dataflow for execution. OCI Dataflow can be configured to run Spark jobs based on the dbt models, allowing for scalable and efficient processing of large volumes of data.
  3. Automated Deployment and Orchestration: With the help of automation tools and CI/CD pipelines, organizations can automate the deployment and orchestration of dbt models and OCI Dataflow jobs. This ensures consistency and reliability in the data analytics workflow, reducing the risk of errors and accelerating time-to-insight.
  4. Scalability and Performance: OCI Dataflow’s serverless architecture ensures that resources are automatically provisioned and scaled based on workload demands, enabling organizations to handle even the most demanding data processing tasks with ease. This scalability, combined with dbt’s efficient data transformation capabilities, allows for high-performance analytics at any scale.

Benefits of Using dbt with OCI Dataflow

  • Increased Efficiency: By combining dbt with OCI Dataflow, organizations can streamline their data analytics workflow, reducing manual effort and accelerating time-to-insight. For example both incremental extract and MERGE are supported on Delta formats — along with many target database systems too.
  • Scalability: OCI Dataflow’s serverless architecture ensures that resources are automatically scaled based on workload demands, allowing organizations to handle large-scale data processing tasks with ease.
  • Simplified Management: Dbt’s modular approach to data transformation and OCI Dataflow’s managed infrastructure simplify the management and maintenance of data pipelines, reducing operational overhead.
  • Improved Collaboration: Dbt’s focus on code reusability and version control promotes collaboration among team members, enabling them to work together more effectively on data analytics projects.

Data Mart on Delta data in OCI Object Storage

This example leverages the DBT demo from Databricks (clone here) but runs on OCI. No changes were made to the DBT project code, the connection to OCI needs configured and the source data needs to be staged before. The demo builds a staging layer and data mart with a sample dimension and fact. All of the cool DBT features like snapshots for slowly changing dimensions could also be used!

Let’s see how to get this up and running….

Follow the regular install instructions for DBT, but for dbt-spark install this fork, there is an open feature request here.

git clone https://github.com/davidallan/dbt-spark.git
pip install ./dbt-spark

You will need an OCI profile added in the DEFAULT profile in .oci/config to connect to OCI and run the code in your OCI Dataflow SQL endpoint (create the SQL Endpoint, this will also need a metastore created, you do this in OCI Data Catalog).

Download and install the ODBC driver for your SQL endpoint, you can find all of that information on the property page for your SQL Endpoint in the section titled ‘Connect’.

I was on a Mac and I have installed the driver in the location shown below, you can create an entry in your .dbt/profiles.yml file and change the host name to your SQL endpoint host shown in the ‘Show details’ panel above. The other values should be fine (you must use sparkservertype=DFI — this is a mandate of the Simba ODBC driver).

oci_dataflow:
target: local
outputs:
local:
type: spark
method: odbc
driver: /Library/simba/ocispark/lib/universal/libsparkodbc_sbu.dylib
host: yourhost.interactive.dataflowclusters.us-ashburn-1.oci.oraclecloud.com
port: 443
#user: dbt
schema: default
sparkservertype: DFI
connect_retries: 5
connect_timeout: 60
retry_all: true

With that done you can now reference the profile in your project;

profile: 'oci_dataflow'

trying testing the connection;

dbg debug

You should hopefully then see a successful test! Something like this;

You can then get going with the interesting stuff!

I used DBeaver to create external tables for the Jaffle Shop data, you would need to replace with your bucket name for your metastore, the OCI object storage namespace and also the folder where your orders, customers and payments reside. Place the data files (copy from here — link TBD) in your external folder, I placed mine in the same bucket as the metastore below;

CREATE schema jaffle_shop;
CREATE schema stripe;


CREATE TABLE jaffle_shop.orders (id INT, user_id INT, order_date STRING, status STRING) USING CSV
OPTIONS (path "oci://yourmetastorebucket@yournamespace/external/orders/",
delimiter ",",
header "true")
;
CREATE TABLE jaffle_shop.customers (id INT, first_name STRING, last_name STRING) USING CSV
OPTIONS (path "oci://yourmetastorebucket@yournamespace/external/customers/",
delimiter ",",
header "true")
;

CREATE TABLE stripe.payments (id INT, orderid INT, paymentmethod STRING, amount INT, created STRING) USING CSV
OPTIONS (path "oci://yourmetastorebucket@yournamespace/external/payments/",
delimiter ",",
header "true")
;

Other than that you can run the dbt-databricks-demo on OCI using dbt run.

All of the other cool DBT features are available like snapshots and incremental extract, data quality checks and much more!

Orchestrating this in OCI? You can containerize the DBT client and run on OCI’s serverless container instance service or just install on your own compute!

Conclusion

By leveraging the combined power of dbt and OCI Dataflow, organizations can create a robust and scalable data analytics workflow that enables them to derive valuable insights from their data more efficiently than ever before. Whether you’re processing terabytes of data or building complex data pipelines, dbt with OCI Dataflow provides the tools and infrastructure you need to succeed in today’s data-driven world.

Further information;

OCI Dataflow SQL Endpoints; https://docs.oracle.com/en-us/iaas/data-flow/using/sql-endpoints.htm

OCI Data Catalog Metastore; https://docs.oracle.com/en-us/iaas/data-flow/using/sql-endpoints-metastore-create.htm

DBT; https://www.getdbt.com/

--

--

David Allan

Architect at @Oracle The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.