Handling Schema Drift in Apache Spark

David Allan
2 min readJan 7, 2019

--

There’s a series of posts here which illustrate how you can handle changes in the data you process in a cost effective manner. Changes can be handled in your ETL without manual intervention lowering your overall TCO and letting data flow around your business with ease.

Golden Gate bridge from Marin headlands

Apache Spark and functional programming languages like Scala and Java 8+, allow you to build implementations that survive longer than their initial version. In the beauty of Spark posts below you can see this in action.

Part 1 transforming sets of columns by datatype, in a low maintenance and flexible manner;

Part 2 transforming sets of columns with name pattern;

Part 3 aggregating data in a flexible manner;

As you can see this makes ETL development very simple and also minimizes the amount of maintenance on code as common changes can be handled without human interaction and constant monitoring of schema changes.

--

--

David Allan
David Allan

Written by David Allan

Architect at @Oracle The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

No responses yet