Data Warehouse Learnings

Setting up a data warehouse solution is relatively straightforward, especially since there are many cloud solutions available. But as with many things that appear to be obvious, the devil is in the details.

The data we imported was mobile marketing data, everything related to cost, performance and conversion rate of in-app mobile advertising. I suspect there is nothing inherently special about this type of data, all types of data would face similar issues.

I will begin with a give quick overview of the setup and then explain what we did and what we learned by doing it.

This Image has nothing to do with the Article.

The basis for data warehouse setup are ETL pipelines: Extract, Transform and Load. ETL breaks the process of filling a data warehouse into three distinct steps, each being depended on the previous step.

Extract is retrieving data from various sources on a regular basis. The sources have various formats, JSON and CSV being the most popular. Sources can be anything from an S3 bucket, some sort of API, SQL database or even FTP server.

Transforming takes the original data and harmonises it with the final data schema. Since all data sources will differ, this step is unique for each data source. Data is also normalised: dates are assigned the same time zone, mobile app ids are mapped to common representations, currencies may be converted, etc.

Load pushes data into the data warehouse. Load is the only component that interacts directly with the data warehouse. It also ensures that data duplication doesn’t occur. Load step is responsible for identifying columns, that when combined, uniquely identify datapoints being loaded.

Microsoft Azure provides a hosted Spark setup for data handling and Jupyter notebooks for programmatically controlling the steps in the ETL pipeline. This product is called Databricks and includes Delta Lake for data storage.

As scheduler Data Factory triggered Jupyter notebooks that extracted, transformed and loaded the data into Databricks. Our programming language became Python.

Jupyter notebooks aren’t the ideal coding environment. It’s non-trivial to get them into a source versioning system and using the browser as a coding environment is sub-optimal.

To improve this, we created Python libraries that we embedded into the Jupyter environment and referenced from the Jupyter notebooks. Python libraries improved code reuse by providing an encapsulated codebase instead of repeating code in various Jupyter notebooks. As part of the deployment, the library was updated in the Jupyter runtime environment.

Libraries were tested separately using test data. So each step for each ETL step could be independently tested against test data. For this, a Docker image was created to imitate the production environment as closely as possible. Basis for this was the PySpark Docker image.

Storage is cheap and backups are good, losing data is bad. So keeping the original data around is a good idea. We kept it in original, untouched form so that the transformers and loaders could be directly applied on the archived data.

Another advantage of storing original data is that ingress costs might actually be higher if old data is pulled from the original sources. Cloud providers charge for network traffic, so retrieving six months of old data from a source might well be more expensive than the storage of six months of data.

To cope with format changes, for example, if extra columns are added or field formats change, the transformers needed to be kept backward compatible. To mitigate this, import all available data dimensions from the original data source.

When retrieving reports from external data sources, select all columns available for a report. Even if not all columns will initially be imported. By obtaining all data from external sources from the beginning, if that data is needed it’s already historically available, locally.

In combination with archiving and having all data dimensions, it becomes that much simpler to extend existing data in the warehouse. There is always data missing and having the data within reach can be invaluable.

We clearly separated each ETL step in the codebase. Effectively what we did was to have a class for each data provider and each class had three basis methods: extract, transform and load.

The extract method retrieved data and stored it in Delta Lake — Azures cloud storage. Transform method took the files stored in Delta Lake and harmornalised (i.e., harmonised and normalised) the data. A load method would take the harmornalised data and load it into the data warehouse based on a set of unique columns.

This established clear responsibility boundaries within the codebase and allowed for better code reuse.

Load methods were explicitly designed to be repeatable without causing data duplication. So we could execute them on the same data multiple times without duplicating data points in the data warehouse.

This flexibility allowed re-running load scripts if something went wrong and data was only partially loaded. It’s far simpler to reimport all missing data than to clean the data warehouse before reimporting. Especially if multiple data warehouses are being feed with data.

Uniqueness for ensuring non-duplication was often based on a combination of columns. Special care needed to be taken when handling NULL values. Databricks and Snowflake have different strategies for handling NULL values for columns when used for merging data into the warehouse.

Initially Databricks was used as data warehouse solution but later replaced by Snowflake. Databricks was simply too slow as an analytics tool. So there was a certain overlap where our loaders were filling two different data warehouses.

The boundarisation of responsibility made it easy to replace the data-warehouse solution since the only component that needed reworking were the load methods. In the interim, we were able to fill both data warehouses in parallel.

Our structure was complete however it wasn’t semantically stable. By semantic, I mean, that data was constant flux: data missing, data fields changed, assumptions about uniqueness changed, and many more little things that required constant watchfulness and attention to details.

Most semantic checks happen in the transform step. An example of this would be to check assumptions on currency. If you assume that all monetary values are in USD, then check that on transform step. This might be as simple as checking for a dollar symbol or for the string ‘USD’ in a monetary value. Or checking currency column, if explicitly present.

If the data semantics are unexpected or can’t be handled, then the transform step would fail preventing the load step from loading inconsistent data into the warehouse. It’s debatable whether partially loading data is better than ignoring the entire dataset, depends on the context to decide what exactly should be done.

Data source usually have APIs to request various reports and datasets. These APIs have a number of parameters for defining the exact datapoints to be retrieved. Most of these parameters have default values. Avoid assuming that default values won’t change. Be explicit and set each parameter, even if using the default value.

An example of this would be explicitly setting the format of a report. The default format might well be the one required however setting the parameter that controls ensures that this is future proof. Start and end dates are another case. Some providers give a default date range (e.g., last 14 days), always explicitly set the date range.

CSV reports are also a potential hazard. Both the escaping character and the field separator vary quite a bit. If possible, explicitly specify both of these.

Source data often doesn’t have clear unique fields. Perhaps this is something particular to the marketing world, but we had a few occasions where assumptions on uniqueness were incorrect. Partly, it was even incorrectly documented.

Uniqueness assumptions can be checked by taking counts of transformed data with and without the specific unique columns. That count should be the same in both cases.

Regular checks for missing data, this can happen either because data wasn’t imported or wasn’t generated by the original data provider. If data is imported on an hourly basis, then check for missing data on an hourly basis.

Simple counts grouped by dates can be sufficient. More specific counts depend on the data context. For example, in mobile marketing applications grouping by application and date is also very insightful, i.e, is there data missing for a specific application for that date.

We ended up creating monitoring tasks that ran regularly check these and provider specific assumptions. Having an early warning system for breaking assumptions makes for better sleeping at night.

Our data warehouse pipeline was completed and worked well. There were may other learnings we made along the way.

Probably the most important learning was having monitoring and sanity checks to ensure that data was correct. Data warehouse technology really is no longer rocket science and most problems come down to data consistencies and failure handling.

One thing to keep in mind, data warehouse solutions are often cloud based, fed with data from external services. This can lead to third party sources causing issues: downtime of one service can mean that data is meaningless for a few hours. It is difficult to fix problems that are outside of ones sphere of influence.

Thanks for reading!

Software Developer & Architect.