If You Want Data Trust, Data Testing Is Not An Option
You most likely have heard the saying : Trust is very hard to gain, but very easy to lose. If you work in data, you know this couldn’t be more true. Becoming data-driven, in other words taking actions on the back of data, has become every company’s new holy grail for a few years now. Yet, for people to take actions on the back of data, they first need to TRUST the data.
When you work in an analytics team for any fast-growing company you quickly come to realise that gaining and maintaining trust is actually not such an easy thing. So many things can derail all your efforts from maintaining accuracy: numerous product iterations leading the data’s logical model to change often and quickly, an ever increasing backlog of tasks, growing number of contributors to a project, etc… And you know this: every error that gets released on production (some might say “in the wild”) will hurt the faith your stakeholders have in the data, and therefore their confidence in making decisions on the back of it. But, you will remain on top. How? Thanks to data testing.
Why Adopt Data Testing?
Data testing is simply a set of automated processes that check assumptions on the data. Here is a list of common data tests:
- Unicity. For example, a user model (i.e. table) should not contain duplicate user ids (as these should be unique), etc.
- Allowed values. For example an order status could have a pre-defined set of acceptable values.
- Non-emptiness. When you expect a field to never be NULL.
- Data recency. When you expect the data of a model to have data until a specific point, for example yesterday.
- Row number. Tracking the number of rows in a model over time, to track for undesired deletions.
- And much more, depending on your own needs!
Data testing is not an option. It’s actually is your best friend in order to ensure a high level of data integrity. Most of the time an error arises, you will realise that there unfortunately was no data test on the model.
Data testing has so many benefits. Not only will it ensure that your team can maintain a high level of confidence in the data outputs, but it will also help significantly cut down time spent on “fixing” issues. And the good news is that data testing is already available out-the-box in multiple data modelling packages such as dbt and sayn.
Embedding Into Processes
In order to ensure that data testing becomes the default and not the exception, analytics teams need to encourage analysts to really think about the data they are modelling and what it represents. This is so they can define the assumptions of what data quality “means” on the back of that. Analysts can then define the tests even before having written the model, and therefore ensure these are added early on.
Of course, sometimes you have missed some assumptions, but most likely you will have covered 90% of the issues that could happen. And when something eventually happens, you can simply ensure you add a data test that will catch the newly uncovered mistake (for example missing values from a mapping table that lead to NULL entries in a report). Should this issue ever happen again, you will be notified immediately and can fix the mapping model early on – as opposed to waiting for a business stakeholder to spot the issue.
At 173tech, we take a tiered approach to data testing so we ensure that the output of data models is of the highest quality:
- First, all data models are being tested locally during the development process. This is so the analyst can ensure that no issue is introduced by the changes under way.
- Second, CI runs automated testing on every pull request. So that if, for any reason, an analyst forgot to do the tests in step one, errors are caught at that level and the changes cannot be merged. In effect, the ability to merge is (partly) based upon a successful run of the tests.
- Third, all data tests are run on the same frequency as the ELT / ETL (depending on which side you prefer the T) runs. In that way, every time new data is added and modelled by the automated processes, the new output is tested to ensure that it doesn’t violate our data quality assumptions.
All of these tests enable us to test all our assumptions on the data of the multiple projects we work on, ranging from simple tests on unicity of keys and allowed values, to more complex tests related to anomaly detection. And they also enable us to concurrently maintain multiple ELTs efficiently. If anything, data testing is probably one of the most powerful “quick wins” any analytics team can implement to ensure they maintain the trust of their stakeholders. Just do it!
How 173tech Can Help
If you are looking to establish a robust data foundation and are not sure where to start, why not reach out to the friendly team today?