Building World-Class Analytics For Startups (Part 2).
March 2, 2020 | By .
In the first part of “Building World-Class Analytics For Startups”, we described how to best get started with advanced analytics through the implementation of five core layers: Extraction, Data Modelling, Reporting, Analysis and Data Science. This is the second part of this series and gives more detail about each layer. It recaps what the layer is about, what matters for each layer and provides some suggested tools. Those suggestions are based on today’s analytics ecosystem — analytics is an extremely fast-paced sector so it is always worth looking out for potential new solutions.
Exploring The Layers
Before we dive in, it might be worth doing a small refresh and having a look at the architecture of the five layers again:
All set? We will now look into each layer individually. Please note that, although we suggest potential tools, you should always assess all possible options before committing. The right tool will always be the tool that fits your own requirements.
The Extraction Layer
What is it?
The “Extraction Layer” consists of automated processes to extract data from your various sources such as app & website events, replica databases, digital marketing platforms, etc. and load them into a data warehouse such as Snowflake, Redshift, BigQuery etc.
- For this layer, the key is to assess the trade-offs between flexibility and automation. When getting started with analytics, you want your team to spend the least time on extracting and maintaining data processes.
- There are several tools available to automate data extraction. Those simply help you to move data from various services into your data warehouse without writing a line of code. It is worth investigating those to save data engineering resources within your team.
- However, make sure that whichever tool you choose (if any) for this stage extracts and loads the data in an efficient way based on how you need to extract data. Extractors can sometimes be built in a one-size-fits-all way which might force you to make inefficient transformation with SQL after the data has been loaded. In such cases, a custom Python extract would be best - you can use SAYN (our open source data processing framework tuned for simplicity and flexibility) for this.
Fivetran, Alooma (if you are on the Google Cloud Platform ecosystem) and Stitch for off-the-shelves providers. SAYN for custom Python extractors.
The Modelling Layer
What is it?
The “Modelling Layer” consists of automated transformation processes that organise the data in your data warehouse so it is easily consumed by reporting tools or analysts.
- First and foremost, make sure you have a modelling layer. Not having one will create extremely inefficient and messy processes down the line. This will hinder data accuracy and increase the time spent by analysts and data scientists on data processing.
- Modelling creates a single “source of truth”. Because all your analysts and reporting tools will end up using the same pre-processed “models” (i.e. data based on standardised and documented definitions).
- Modelling is, in essence, a set of tasks (e.g. SQL queries) running in a specific order to create tables or views (i.e. “models”) that can be directly consumed by reporting tools or analysts and data scientists. Because there is a specific order, you want to be able to define relationships between tasks (i.e. which jobs need to be run before job X is processed).
- For example, if your team ends up writing SQL queries to calculate retention often, you could create a model that stores the retention status of each user. In that way, analysts can directly read the retention model to calculate retention instead of re-writing the query themselves.
We recommend using a solution that lets you orchestrate tasks with both SQL and Python. Keep in mind that flexibility is key to long-term scalability. Even if you use an automated extraction tool in your extraction layer, it is quite likely you will need some non-SQL processes (e.g. Python) in order to extract data for which no extractor is provided or for data science. Our open source data processing and modelling framework called SAYN is perfect for that!
The Reporting Layer
What is it?
The reporting layer consists of a reporting tool that sits on top of your data models and enables non technical users to visualise and digest the data. This covers all the dashboards you do for business users in order to automate insights.
- Reporting is the front-end of your analytics team. This is what all business users see and use every day. Make sure reporting is accurate and easily digestible, otherwise you will seriously hinder the trust in data and analytics within your organisation.
- Use a tool which has a simple user interface and makes data easy to consume for business users. Having a drag-and-drop option is a powerful feature as this means any user can build their own dashboards. This also means you do not need to write SQL for every visualisation (and re-write SQL when you need to make modifications!).
- One of the key purposes of reporting is to automate repeated data requests and release analytics resources so your team can really spend time finding insights and generating business value.
- Ensure your reporting tool has a level of logical model (i.e. it enables you to define some metrics based on your underlying data). However, make sure that all your heavy data processing is done in the modelling layer — you want to avoid having business logic living in different places.
- Ensure the tool has an efficient workflow. You want making changes to be easy and not error prone. Again, reliability is key for reporting — you need to minimise downtime.
Looker, Metabase (free and a good option to get started), Tableau.
The Analysis Layer
What is it?
The “Analysis Layer” is a workflow for analysts to do in-depth analyses of the data. This layer is different than the reporting layer in the sense that it is used to investigate specific topics in much greater detail than you would do with dashboards. If you end up repeating an analysis frequently, it should then be moved to the reporting layer with an automated dashboard.
- For this layer, you need an easy workflow for your analysts to explore data and present insights. Jupyter notebooks are a good option, they are widely used and have kernels for many programming languages.
- Using a programming language and notebooks over Excel also has many other advantages. It brings transparency to data processing, enables automated re-runs (as you only need to change the underlying dataset) on top of allowing you to handle larger datasets as well as leveraging all the programming languages’ modules.
- Have a process to make the reports easily shareable across the organisation. You may consider using JupyterHub or hosting on a web portal the output of Jupyter notebooks (which supports export of various formats).
Jupyter Notebooks, JupyterHub.
The Data Science Layer
What is it?
The “Data Science Layer” is a collection of models and algorithms predicting user behaviours and running on production. Those can be refreshed at any desired frequency with results embedded into other systems for optimisation purposes. The data science layer should be entered once you have a robust structure on all other ones.
- Implementing data science models on production is a really broad topic and depends on your overall architecture, so I will not get into details in this post.
- For the exploration phase (i.e. building the model), this can be done through the “Analysis Layer” using Jupyter notebooks where your data scientists explore various solutions and optimise model performance.
- Creating API endpoints for your data science models to power other parts of your business (e.g. customised user experience, optimised marketing spend, tailored CRM flow, etc.) is often a good solution to begin with.
Jupyter notebooks for exploration and Flask applications in order to make model predictions available to other services.
Bringing The Layers To Life
The layers discussed will be helpful in designing your own analytics infrastructure. In addition, they will help you ensure scalability of your infrastructure and processes in the long term. We know things can sometimes get tricky when it comes to the implementation stage, especially if you are getting started with analytics. We are always happy to chat and see how we could help, so don’t hesitate to get in touch!