Best Practices: Data Stack

Best Practices:
Data Stack

A data stack is a collection of tools and technologies that work together to collect, store, process, and analyse data across an organisation. It forms the foundation of modern data infrastructure, enabling teams to turn raw data into actionable insights. As an impartial agency, we recommend tools based on scalability, integration ease, performance, and community support. Our recommendations aim to suit a range of needs, from startups to enterprise environments, without bias toward any vendor.

Modern Data Stack Components

Data Sources

These are your various systems that generate and collect data. It ranges from your website, marketing channels and attribution tool, to CRM systems, backend and operational databases and more. A scaling company in its first few years will typically have around 10 data sources.

Data Extraction

Your sources of data usually sit in silo, making it difficult to draw insights and conclusions across the customer journey. Ideally, all information should be connected together within a single source of truth. To do so, you first need to bring all data together. Data extraction is the process of retrieving data from multiple sources into a single destination.

Data Storage

This is where all your data sources are centralised. The main types are data warehouses, data lakes or lakehouses. In general, data warehouses are designed to store structured data, data lakes for structured, semi-structured and unstructured data, and lakehouses support a hybrid approach.

This is likely to be the highest cost component within your data stack so you need to consider business use cases today and in the future, scalability and cost upfront. The main options here are BigQuery, Snowflake, Redshift, Microsoft Azure SQL and Databricks.

Data Modelling

Once source data lands into your chosen data storage, it will go through a process of transformation based on your unique set of business logics. The resulting set of data models, organised by business concepts, is the source of truth for all your downstream analytics needs by other the data team and business users.

The most widely used tool for data modelling is dbt. It is SQL based, has a large open source community, and designed for both data engineers and analysts to contribute to the pipeline.

Data Analysis

Where most business users will rely on dashboards to monitor trends and the health of the business, the analysis layer is where you deep-dive into specific topics in much greater detail. It is typically used by Data Analysts who are looking for the “why” behind the “what”.

To perform deep-dive analysis, you need a workflow to query, explore and present data. Jupyter Notebook is a good option with Python and SQL as the main programming languages. The main benefits over Excel spreadsheets are the ability to leverage a wide range of Python libraries for data processing and incorporate machine learning models; combine code, charts and text explanations in one document for both data exploration and presentation; and easy to share and rerun. Although, if you find yourself repeating the same analysis, consider moving it to the reporting layer as an automated dashboard.

Reporting

This is the layer with the highest impact in democratising data among business users. Here modelled data is turned into charts and dashboards which make information easy to digest and investigate. One of the key purposes of reporting is to automate repeated data requests so that your data team can focus on finding deeper insights and generating more business value.

Dashboards are a great tool for everyday monitoring of key stats that will guide your business. The main tools that we recommend are Metabase, Looker (not Looker Studio), Tableau, and Power BI.

Data Science & AI

These are more advanced analytics with a wide range of use cases, e.g. predictions, dynamic user segmentation, matching algorithms, text analytics. Ideally, this layer should sit on top of already modelled data to leverage the same pipeline for cleaned and enriched data.

The key thing to note here is to capture the results of your data science and AI models within your data warehouse as much as possible so they can be combined with other data points and reused by other parts of the business.

Data Activation

So far, we discussed the process of extracting, loading and transforming (ELT) data from source and a number of use cases within the data pipeline. To make it a full circle, the last step is sending modelled insights back into the source systems. This process is also referred to as Reverse ETL. This enables your go-to-market (GTM) teams to leverage intelligence at scale for a wide range of purposes, e.g., target high-LTV audiences, personalised CRM strategies, automated sales workflows. There are three tools we currently recommend for this part of the data stack: Census, DinMo and Hightouch.

Orchestration

An orchestrator is a tool that is not present in all data stacks. Many setups use multiple systems that are not directly connected to each other. For example, you can have an EL tool like Fivetran extracting data and storing it in your warehouse, and a modelling system like dbt using dbt Cloud, both of which manage their own execution schedules.

An orchestrator becomes key when you require custom extraction, as it can provide a single pane of glass to trigger, monitor and debug all tasks in your pipeline, while serving as the baseline for custom code to do extraction or data science.

Each different tool has its own pros & cons that you will only be aware of if you’re using them everyday like we are.

Adrian Macias, VP Engineering

Data Stack: Key Considerations

There is no such thing as a ‘standard data stack’ so choosing which tools are best suited to your needs is critical. What should your key considerations be when designing a data stack?

Business Use Cases

Here at 173tech, we firmly believe that the only way in which you can create value from data is a clear focus on addressing business problems, questions and blindspots. That applies from the get-go when designing a data stack. Think about your current use cases for data in terms of reporting, analysis, automated workflows etc, and also what your needs might look like in the future, including embedding AI solutions.

While everyone is excited about the possibilities of developing AI solutions today. However, in most use cases, it will rely on an integrated data pipeline. Start by solving immediate business needs, while building iteratively towards an AI future.

Data Volume

Understanding how much data you are likely to process can be challenging when you start. First, consider how many data sources you intend to bring into your stack. Whilst the answer for many businesses may ultimately be ‘all of them’, we always advise starting with your core source of data then expand.

To come up with an estimate, you should think about a good proxy metric, this is a value (or values) that give you an idea of the total number of records your analytics system will have to deal with. What a good proxy metric will depend on your type of business: number of orders would be a good predictor for an eCommerce business; number of registrations and active users for a social network or freemium model. Try to think of what your customers are doing on your platform and how many actions (events) will be produced to arrive at a good estimate.

You should also consider the difference between storages when doing your calculations. For example, a backend database will not be as efficient storing the same data as a data warehouse which can benefit from more efficient compression as well as other differences in metadata like indexes. You can sometimes find a good conversion multiplier from one system to another from the tool’s documentation, but having access to multiple examples will be more accurate.

Data volumes are only going to grow as you acquire more customers, generate more orders or activities, and integrate more sources of data. You should consider the pricing model of the tools you are choosing for your analytics stack, some will charge based on the number of records processed, some by the data volume in GB/TB and how your historical data will be counted towards the cost.

Cost

Your data volume and usage will continue to grow as your business and data capabilities expand. Cost and scalability is one of the top concerns we hear from founders and CTOs. When estimating tool costs, you need the following information ready: current number of records (total, new and updates), projected growth with existing data sources and potential new ones, and expected reporting usage today and in the future.

Each tool will have its own pricing structure, e.g. subscription tiers containing different features, compute vs. storage costs. It is useful to create a comparison table for each tool component of your data stack with desired features, estimated cost today and how it is expected to increase as you scale.

Another aspect of cost is engineering resources when it comes to development and maintenance. For example, a self-hosted option will save you on tooling cost but may require more time from more senior team members.

Having a clear data strategy and business use cases upfront will also help prioritise what data to extract and reduce your overall cost.

Existing Tech Ecosystem

Your data infrastructure is an extension of your existing tech setup. It extracts data from various sources, including your operational database, third-party tools (e.g. Salesforce, NetSuite, GA4) and marketing channels. From this point onwards, it has a set of data tools to perform modelling, reporting, analysis, data science and activation tasks. Even though your data pipeline is relatively contained from the rest of your tech ecosystem, it lives within it, and so you need to consider the wider setup.

A few key factors here are your existing cloud platform, skillset within the team and potential benefits staying within the same service provider, for example, better compatibility and simplified billing if you stay within the Microsoft or Amazon suite of products.

Security & Data Privacy

You might also have specific requirements for compliance and data privacy. Most tools will provide details on how they meet regulatory standards such as GDPR, HIPAA or CCPA. Some tools offer an open source and self-hosted option so you have the flexibility of applying your own security standards.

If you are designing a data solution to be used by multiple clients (e.g. a reporting feature as part of a SaaS product), you may need to consider multi-tenancy and related permission requirements.

Book A Call

Expert help is only a call away. We are always happy to give advice, offer an impartial opinion and put you on the right track. Book a call with a member of our friendly team today.

Avoiding BI Debt

At 173Tech, we are passionate about unlocking the true value of data. However, one of the biggest challenges in achieving this is that BI infrastructure often has an indirect impact on business users, making it difficult to justify investment in improvements.

To maximise efficiency, businesses must minimise BI (business intelligence) debt by first building a solid data foundation. Poor choices can quickly shift this balance, especially when growing rapidly, amplifying the negative consequences. To help you navigate this challenge, we’ve compiled 7 essential tips for scaling analytics efficiently while keeping BI debt to a minimum.

1. Define Clear Goals & Establish a Focused Data Strategy

It may seem obvious, but defining clear objectives for analytics is often overlooked. The primary goal of analytics is to enhance business understanding and identify optimisation opportunities. However, not all insights yield significant value. For example, uncovering a 10x ROI opportunity in a segment that constitutes only 0.5% of your user base may not be worth the effort.

Before initiating analytics projects, pinpoint the core business challenges—whether they relate to acquisition, retention, or another aspect. Align your data collection strategy with these key challenges, ensuring you gather only the necessary data instead of trying to collect everything from the outset.

2. Implement A Data Warehouse

For a lot of companies, their journey with analytics starts with the reporting available inside tools like Shopify or Google Analytics. Gaps soon emerge in their knowledge and so they aim to plug them with a range of tools. Over time they have one tool for product analytics, one tool for customer analytics, one tool for sales data and all of these tools overlap in some way and give a distorted view of the customer journey. Most of these tools are designed to hook you in when you have small usage, but the cost will quickly grow. Take Lifetimely (a popular tool used by eCommerce companies to understand the lifetime value of their customers) At just 3000 monthly orders it costs $149, but that costs doubles to $299 at 7000 orders, which may cost more than an entire data stack.

To create one source of the truth, you need one centralised place to store and model your data and this should be a data warehouse, (e.g.Redshift, Snowflake, BigQuery, Databricks) These databases are optimised for handling large-scale calculations efficiently. While operational databases (e.g. PostgreSQL, MySQL) may suffice initially, they will struggle to scale, leading to longer query times and wasted analyst resources. Whilst having an additional space just for analytics may seem like an added cost, but as your company grows it will actually save you money.

3. Automate As Much As Possible

Ad-hoc data pulls can be a major drain on time and resources. Every time an analyst manually extracts, cleans, and processes data to answer one-off business questions, it diverts focus from more strategic work. Instead of constantly responding to unique requests, businesses should aim to automate as many recurring questions as possible. Building reusable data models and dashboards ensures that frequently asked questions; such as sales performance, customer retention, and marketing ROI, can be answered instantly without requiring fresh data pulls. By reducing the need for manual intervention, analytics teams can focus on higher-value insights, and business users gain faster access to the information they need. Tools like Fivetran and Stitch can automate data extraction and loading without requiring extensive coding. dbt is our only recommendation when it comes to data modelling.

4. Data Modelling To Gold Standard

Data modelling is a direct cost-saving measure, not just a technical best practice. By structuring data efficiently within your warehouse and automating transformations, it eliminates redundant processing and reduces query run times, both of which can drive up cloud computing costs. Without it, analysts often rewrite similar queries with minor inconsistencies, leading to duplication, inefficiencies, and an increased risk of errors that can result in costly decision-making mistakes. Poorly modelled data also inflates storage expenses, as duplicate or unstructured datasets pile up unnecessarily. By ensuring a single source of truth, improving SQL efficiency, and maintaining data integrity, data modelling directly cuts down on wasted compute power, storage overhead, and manual troubleshooting, leading to tangible cost reductions across the business.

5. Be Wary Of Overfitting Models

Overfitting occurs when a model learns the specific patterns of the training data too well, leading to poor generalisation on new, unseen data. For example, a simple model for predicting weight based on height would use a linear regression (e.g., Weight = β₀ + β₁ * Height), while an overfitted model might include multiple variables and complex interactions, fitting the training data too closely and risking poor generalisation to new data. This results in the need for continuous retraining and fine-tuning, which can be resource-intensive and costly. By focusing on simpler models that are less prone to overfitting and generalise better to different datasets, teams can reduce the time and computational power needed for training. Simpler models often require fewer resources, leading to lower costs and faster iterations. Additionally, these models are more likely to maintain stable performance across a variety of data, reducing the need for frequent adjustments and ensuring a more cost-effective and efficient development process.

6. Peer Review Your Work

Peer reviews play a crucial role in cost reduction by improving the quality of output and reducing the likelihood of errors, which can be costly to fix later. When a second pair of eyes reviews a report or analysis, it helps identify potential issues early on, saving both time and money that might otherwise be spent correcting mistakes. Alongside this you should adopt version control, which allows teams to track changes, preventing costly mistakes and rework by providing the ability to quickly revert to previous stable versions if errors occur.

Additionally, by ensuring the accuracy and credibility of the work, peer reviews enhance the trustworthiness of the analytics team within the organisation. It is important that alongside peer reviews of the code, that all analytics outputs (models, charts etc) are verified against their source of truth, but also with a stakeholders to ensure that the numbers look right. This can lead to more efficient collaboration and decision-making, reducing delays and the costs associated with poor-quality work or rework. Ultimately, peer reviews help streamline workflows, prevent costly errors, and contribute to a more cost-effective operation.

7. Get The Right Help

Using an agency like 173tech can help reduce costs by providing specialised expertise and resources without the need to hire full-time staff. Agencies can quickly scale up or down based on project needs, allowing companies to avoid the overhead of maintaining a large in-house team. They bring efficiency by leveraging proven tools and methodologies, reducing the learning curve and time spent on experimentation. Additionally, agencies often work on multiple projects across various industries, enabling them to offer cost-effective solutions and insights gained from diverse experiences, ultimately helping businesses avoid costly mistakes and optimise their analytics efforts.

Oliver Gwynne, Data Strategist

Data is an upfront costs with an unlimited return on investment. Once data is set up and modelled it will work forever!

Cost Effective Data Stack

One of the most common questions we are asked at 173tech, is what is the cheapest stack to run? One of the biggest blockers to this traditionally was the cost and setup of infrastructure. Luckily, thanks to cloud and open-source providers, this cost has been significantly reduced to the point where we have been able to set up a complete data stack for only a few hundred dollars a year.

Below we will give you a real-life example of a cost-effective data stack we have built, the tools we used, and the general costs involved. (Obviously, prices are always prone to change, so be sure to double-check before purchasing.)

Key Components

Data Warehouse: Google BigQuery

ELT: Fivetran

Modelling: dbt

Deployment: Google Cloud Build / Run / Scheduler

CI/CD: GitHub Actions

Visualisation: Metabase

The first time we built this stack, we were amazed that such great tooling could come at such a low monthly cost. Since then, we have implemented it for several clients. The cost-effectiveness of this stack applies mostly to low-data-level ecosystems (typically in the tens of gigabytes). It works extremely well for e-commerce and B2B industries, which are by nature less data-hungry digital businesses, or early-stage businesses that generate less data.

Here are a few benefits of this stack:

Minimal vendor lock-in, as the stack mostly uses open-source solutions.

Full control over business logic and reporting on metrics.

Serverless approach, making deployment more secure and resource-efficient.

Extremely cost-efficient.

Controlling Costs By Design

While it is tempting to collect and store as much data as possible, this can quickly lead to unnecessary costs. To keep expenses low, it is crucial to pick and choose which data is truly necessary for analysis and decision-making. Every query in BigQuery incurs a cost based on the amount of data scanned, meaning that inefficient queries or redundant data storage can significantly inflate your monthly bill. By carefully defining key performance indicators (KPIs) and only storing and querying the data needed for business insights, companies can reduce costs without sacrificing the value of their analytics. Using partitioned and clustered tables, as well as materialised views, can further optimise query performance and reduce spending. Controlling data costs requires solid strategy and understanding of which metrics come from which sources.

The Data Warehouse

The data warehouse is where we centralise all our data. For this, we recommend using Google BigQuery, as it is priced based on the number of terabytes being stored, written (when loading data), and read (when running queries) on a monthly basis. At the time of writing, the cost of reading a terabyte of data is $5, with the first terabyte being free. The storage costs are $20 per terabyte, with the first 10 gigabytes being free. That means if your data estate is small in volume, you can take advantage of this pricing structure, making your warehousing solution extremely cost-effective! We recommend that where BigQuery is the warehouse of choice, companies use Google Cloud Platform (GCP) as the overall ecosystem for analytics deployment.

Warehouses like Snowflake and Redshift often start off very cheap (or even free) but as soon as they have hooked you on that lower tier, costs can quickly balloon.

Extract, Load, Transform (ELT)

Although Python scripts are free to write and run, they require ongoing maintenance. Over time, as data pipelines grow more complex, ensuring these scripts run correctly and efficiently becomes an overhead. Bugs need fixing, dependencies require updates, and small changes to the business model might necessitate script rewrites. This is why we recommend a managed version from tool provider, Fivetran. Fivetran’s team can handle the upkeep and maintenance and sure that your connector is of high quality. The cost is dependent on how many records are pulled and updated.

Modelling

For data modelling, we recommend dbt which is a free tool. The way in which data is modelled is key to ensuring efficient and cost-effective analytics. Well-structured, performant models reduce query costs and improve overall system responsiveness. dbt enables batch processing and incremental transformations, meaning only new or changed data is processed rather than reprocessing entire datasets. This approach minimises compute costs, optimises performance, and ensures that transformations scale efficiently as data volume grows. Additionally, dbt encourages modular, reusable models, improving maintainability and collaboration across data teams.

Another way to minimise costs, is that it is often better to implement data transformations directly within the data warehouse using SQL-based approaches, such as materialised views or pre-aggregated tables. This reduces the need for constant script monitoring and debugging while also improving performance by leveraging the efficiency of the database engine.

Deployment

For deployment, we use a serverless approach, which means that resources to run the code are only active while the code is executing. Instead of paying for a full-time machine, you only pay for a few minutes of execution time daily. This also enhances security by reducing potential attack surfaces. For this part of the stack, we use Google Cloud Build, Google Cloud Run, and Google Cloud Scheduler—together, they run for just a few pennies a month since the overall data processing time is short. With these tools, we effectively build a Docker image and run containers daily. You can automate the Cloud Build process for Continuous Delivery (CD) through the Google Cloud Run interface with a few clicks.

Deployment

Data Testing and Continuous Ingestion

GitHub Actions provides a generous amount of free action time for CI/CD processes, meaning that your CI/CD can cost you nothing monthly. GitHub Actions can replace Cloud Build for CD to centralise all processes, though this is unnecessary if you want to keep your infrastructure simple.

Visualisation

Whilst there are many free visualisation tools such as Looker Studio, we do find that for a little bit more money, you can get a lot more value. Visualisation is often the most visible part of the data pipeline for business users and data trust can be severely impacted with dashboards that freeze, lag or load slowly. If you want to keep a low monthly bill, you can use Metabase, an attractive open-source reporting solution. If you deploy it yourself, you will only pay for the resources your Metabase instance consumes (likely around $25–$50 per month initially). The hosted version is only $85 a month at time of writing and we believe this provides great value for money,.

Data Engineering and Setup

The most expensive part of a data stack? The people! In 2024 the average salary for even a junior Data analyst in the UK was £30k. Typically setting up and optimising your data stack requires experienced Analysts and Engineers. that is why working with 173tech provides such a cost effective alternative. We help growing businesses leverage data quickly and cost-effectively, focusing on long-term scaleable systems.