Handling Schema Drift in Apache Spark

2 min readJan 7, 2019

There’s a series of posts here which illustrate how you can handle changes in the data you process in a cost effective manner. Changes can be handled in your ETL without manual intervention lowering your overall TCO and letting data flow around your business with ease.

Apache Spark and functional programming languages like Scala and Java 8+, allow you to build implementations that survive longer than their initial version. In the beauty of Spark posts below you can see this in action.

Part 1 transforming sets of columns by datatype, in a low maintenance and flexible manner;

The Beauty of Apache Spark (1)

There are some things in life that are just at the right place at the right time, they have learned from a lot of…

medium.com

Part 2 transforming sets of columns with name pattern;

The Beauty of Spark (2)

Following on from the previous post on productive Spark code for macro-like transformations here, I’ll extend the…

medium.com

Part 3 aggregating data in a flexible manner;

The Beauty of Spark (3)

Following on from the previous post on productive Spark code for macro-like transformations here and here I’ll extend…

medium.com

As you can see this makes ETL development very simple and also minimizes the amount of maintenance on code as common changes can be handled without human interaction and constant monitoring of schema changes.

Handling Schema Drift in Apache Spark

The Beauty of Apache Spark (1)

There are some things in life that are just at the right place at the right time, they have learned from a lot of…

The Beauty of Spark (2)

Following on from the previous post on productive Spark code for macro-like transformations here, I’ll extend the…

The Beauty of Spark (3)

Following on from the previous post on productive Spark code for macro-like transformations here and here I’ll extend…

Written by David Allan

No responses yet