5 Key Considerations When
Designing A Data Stack
Introduction
A data stack is all the different tools you will use to organise, transform, visualise and analyse data. There is no such thing as a ‘standard data stack’ so choosing which tools are best suited to your needs is critical. What should your key considerations be when designing a data stack?
Business Use Cases
Here at 173tech, we firmly believe that the only way in which you can create value from data is a clear focus on addressing business problems, questions and blindspots. That applies from the get-go when designing a data stack. Think about your current use cases for data in terms of reporting, analysis, automated workflows etc, and also what your needs might look like in the future, including embedding AI solutions.
While everyone is excited about the possibilities of developing AI solutions today. However, in most use cases, it will rely on an integrated data pipeline. Start by solving immediate business needs, while building iteratively towards an AI future.
Data Volume
Understanding how much data you are likely to process can be challenging when you start. First, consider how many data sources you intend to bring into your stack. Whilst the answer for many businesses may ultimately be ‘all of them’, we always advise starting with your core source of data then expand.
To come up with an estimate, you should think about a good proxy metric, this is a value (or values) that give you an idea of the total number of records your analytics system will have to deal with. Which good proxy metric will depend on your type of business: number of orders would be a good predictor for an eCommerce business; number of registrations and active users for a social network or freemium model. Try to think of what your customers are doing on your platform and how many actions (events) will be produced to arrive at a good estimate.
You should also consider the difference between storages when doing your calculations. For example, a backend database will not be as efficient storing the same data as a data warehouse which can benefit from more efficient compression as well as other differences in metadata like indexes. You can sometimes find a good conversion multiplier from one system to another from the tool’s documentation, but having access to multiple examples will be more accurate.
Data volumes are only going to grow as you acquire more customers, generate more orders or activities, and integrate more sources of data. You should consider the pricing model of the tools you are choosing for your analytics stack, some will charge based on the number of records processed, some by the data volume in GB/TB and how your historical data will be counted towards the cost.
Cost
Your data volume and usage will continue to grow as your business and data capabilities expand. Cost and scalability is one of the top concerns we hear from founders and CTOs. When estimating tool costs, you need the following information ready: current number of records (total, new and updates), projected growth with existing data sources and potential new ones, and expected reporting usage today and in the future.
Each tool will have its own pricing structure, e.g., subscription tiers containing different features, compute vs. storage costs. It is useful to create a comparison table for each tool component of your data stack with desired features, estimated cost today and how it is expected to increase as you scale.
Another aspect of cost is engineering resources when it comes to development and maintenance. For example, a self-hosted option will save you on tooling cost but may require more time from more senior team members.
Having a clear data strategy and business use cases upfront will also help prioritise what data to extract and reduce your overall cost.
Existing Tech Ecosystem
Your data infrastructure is an extension of your existing tech setup. It extracts data from various sources, including your operational database, third-party tools (e.g. Salesforce, NetSuite, GA4) and marketing channels. From this point onwards, it has a set of data tools to perform modelling, reporting, analysis, data science and activation tasks. Even though your data pipeline is relatively contained from the rest of your tech ecosystem, it lives within it, and so you need to consider the wider setup.
A few key factors here are your existing cloud platform, skillset within the team and potential benefits staying within the same service provider, for example, better compatibility and simplified billing if you stay within the Microsoft or Amazon suite of products.
Security & Data Privacy
You might also have specific requirements for compliance and data privacy. Most tools will provide details on how they meet regulatory standards such as GDPR, HIPAA or CCPA. Some tools offer an open source and self-hosted option so you have the flexibility of applying your own security standards.
If you are designing a data solution to be used by multiple clients (e.g. a reporting feature as part of a SaaS product), you may need to consider multi-tenancy and related permission requirements.
Conclusion
Above we highlight a number of key considerations as you start putting together your modern data stack. As you dig deeper into each factor and tool component, there are additional factors and complexity. We will be addressing them in future posts. Subscribe and stay tuned!
If you are looking to build or migrate to a modern data infrastructure and are not sure where to start, why not get in touch with the 173tech team for some impartial advice?