Best Practices:
Data Extraction
You have data. But it is stuck inside siloed sources such as GA4, third-party tools and databases. So how do you get it out?
Data extraction is the process of retrieving data from multiple sources and consolidating it into a single destination. This method allows organisations to aggregate and analyse data from different origins, providing a comprehensive view and aiding in better decision-making.
Getting The Most Out Of Data Sources
The starting point for a lot of organisations in implementing data is to understand the different data sources you are using today, where different metrics live and which ones are the most important…
What Is A Data Source?
A data source is anything which produces digital information. This could be a file, a programme, a website etc. Every organisation uses multiple data sources everyday, and often they have to combine metrics from multiple sources to get the answers they need.

Customer Journey Mapping
A great way to identify all of your data sources is to first map out your customer journey and touchpoints. At the acquisition stage, prospective customers might see your brand on social media, through online advertising, search engine results, blog posts etc. During conversion, they may visit your website, review sites, sign up for a free trial. For eCommerce, customers may add various products to watch lists and carts. Once they have subscribed or purchased, their activities and profile details will be stored within a CRM or ERP systems or your backend database. They may contact support for various reasons. And lastly if they love your brand, they will spread the word via social media and review sites or refer your products to friends and family.
By mapping out your customer journey, you will establish:
A list of key data sources
Priority of data sources based on usage and the impact across the customer journey
Which data sources are relevant to different business functions
Key business questions along the customer journey
Any current data gaps
Where Are Your Existing Reports?
In addition to different customer touchpoints, consider how reporting is done today. You may have various spreadsheets, downloading reports from various tools etc. These can give you a good idea on key reports and metrics being used by different teams, their importance, and any duplications, inconsistencies or discrepancies.
How Do You Determine Which Metrics?
Creating a list of key metrics and identifying their source involves a strategic approach. All of your data efforts will need to tightly align to your business goals, so a clear understanding of your short and medium goals is essential. No doubt each area of your business will have multiple KPIs and metrics that need to be tracked, but as a starting point it may be better to take a step back, consider the 10 most important things you need to know to run your business and start with those.
It can often be helpful to consider your current reports in the context of business decisions. What actions can your team make? What data can help inform those actions? This thought exercise should help you focus on those metrics which really matter and less so on the “nice to have” information.
For each metric, you should define its business meaning, how it is calculated, and where the source data comes from. For example, when do you consider someone converted as a customer? Is it when they created an account, or once they purchased an item? We recommend creating a data dictionary for this exercise, which will be used as a blueprint for your data modelling and implementations later.
How To Prioritise Your Data Sources?
Having mapped out your customer journey, existing reporting and your KPI list, you should have a solid understanding of which data sources are the most important.
Other factors you should consider when prioritising and integrating data sources:
Accessibility. Some data sources may be easier than others to integrate. Check if it allows data transfer directly into your data warehouse (e.g., GA4 to BigQuery) or if it is available via a data extraction tool such as Fivetran or Stitch.
Timeliness. You may need data visibility more frequently than what your current reporting provides.
Manual intervention. Your current processes may require many manual steps which are prone to error and affect data accuracy, hence a need to prioritise their automation.
Gut-feeling vs. data-driven. You may want to change how you define certain data to follow a consistent and data-driven approach. For example, “hot” vs. “cold” leads are entered based on the gut feeling of each sales person. Instead, you would like to base it on past sales activities. These types of insights are much easier to calculate once you integrate the data source.

Oscar Borden, Data Engineer
Companies are often guilty of extracting too much information that is ‘nice to have’ and not focusing on the ‘need to have.’
Data Extraction: Key Considerations
Data extraction is the process of retrieving data from multiple sources and consolidating it into a single destination. This method allows organisations to aggregate and analyse data from different origins, providing a comprehensive view and aiding in better decision-making.
Avoiding Error-Prone and Non-Scalable Solutions
One of the most common yet inefficient methods of data extraction is manual entry. This often involves downloading data and copying it into spreadsheets or other tools. Manual entry is of course prone to human error. A study conducted by the Journal of Accountancy found that manual data entry can result in error rates between 1- 5%. Furthermore, it introduces a latency in data availability, as human processing times cannot match automated systems.
The other common method of extracting and combining data would be through excel macros. Automation scripts or macros in Excel can fail or produce incorrect results due to unexpected data formats or changes in data structure and are not designed to handle very large datasets efficiently. Excel can also create a data silo, where data is stored in isolated files that are not easily accessible or integrable with other systems.
A more effective approach is to centralise your data into a data warehouse/lakehouse, ensuring that all data is accessible from a single point and is easier to manage and analyse.
How to Automate Data Extraction
To automate data extraction, there are two common scenarios.
The first would be to utilise EL (Extract and Load) tools like Fivetran, Airbyte, and Stitch which offer pre-built connectors with intuitive point-and-click interfaces, eliminating the need for complex coding. These tools simplify the process of integrating various data sources, making data management more efficient and less error-prone.
Alternatively, custom Python scripts can be employed for data extraction. This approach is beneficial when dealing with uncommon data sources not covered by existing tools or when the cost of EL tools is prohibitive. Custom scripts, however, require more resources for maintenance and development over the long-term.
Frequency and Methods of Data Extraction
Data extraction tools allow users to select the frequency of data retrieval, offering three primary methods:
Full Extraction: This method involves pulling all available data from the original source on every execution of the extraction process. You are more likely to do a full extraction when populating your data warehouse/lakehouse for the first time. There are cost implications associated with extracting data (as a general rule of thumb, the more data you pull the higher the cost) and so it is better not to rely on a full extraction every time.
Incremental Batch Extraction: With this model only the data that has changed since the previous execution is pulled. It improves efficiency and reduces strain on resources, allowing for faster processing of smaller batches, saving time and cost. Every source of data will have different capabilities to identify changes and to be selective with in the data pull, but you would normally accomplish an incremental extraction process by identifying the correct timestamp in the data you already have in the warehouse.
Incremental Stream Extraction: Incremental stream extraction is a data extraction method that focuses on continuously capturing and processing data changes as they occur, rather than processing data in predefined chunks or batches. This approach is particularly useful for applications that require real-time or near-real-time data updates. The downsides of this approach are that it can be considerably more costly, requires more oversight and as more data is being processed, also has a cost to the environment.
Cost Considerations
It is important to understand that higher data volumes and extraction frequency come with increased costs, including later storage expenses. Therefore, it is wise not to extract everything but only the data points which will relate to your KPIs. 173tech always advises creating a data dictionary to help define what these metrics are, what is considered the source of truth and how they are calculated. In this way you can narrow down exactly what data you need to extract and why.
Expert help is only a call away. We are always happy to give advice, offer an impartial opinion and put you on the right track. Book a call with a member of our friendly team today.
How To Choose A Data Extraction Tool
Here are the key factors in choosing a tool and getting the most out of it.
Strategy
The first step when thinking about extracting data is to carefully consider, which metrics live in which data sources. For more info on this be sure to check out our previous article here.
While it may be technically feasible to extract and load all available data from your source, that is rarely advisable. The volume of data you extract directly impacts costs, as many E/L tools operate on a consumption-based pricing model, where the more rows you sync, the higher the price. And of course the higher the costs to then store and process that data.
Therefore, it is essential to discern what data is truly crucial to your business objectives and operational needs. Rather than adopting a blanket approach of copying all available data on the basis that it might become useful later, instead think about how that data is used across your customer journey, which teams might need to access it and the decisions that will be made. In this way you can extract what is most meaningful to your business and streamline the data management process.
Connectors
When it comes to assessing how you will extract your data, you should always first look at the existing connectors that are available in tools such as Fivetran and Stitch. Leveraging pre-built connectors significantly reduces the time and effort required for integration. These connectors are specifically designed to extract data from a wide range of sources, including popular databases, cloud services, and applications.
The second thing to assess is reliability. Existing connectors are often developed and maintained by experienced teams with in-depth knowledge of the underlying data sources and integration challenges. They undergo rigorous testing, quality assurance, enhancements and bug fixes. All of which you will have to take on internally if you need to instead create a custom extractor.
It is sometimes difficult to ascertain however the reliability of existing connectors. As a general rule of thumb, the more widely your data source is adopted, the more likely that the connector will have development time spent on it. So if you need to connect Meta, it is quite likely the solution will be more robust than say, GreenVoice.
You can also evaluate the reliability by looking at factors such as transfer speed, throughput and latency. Be sure to look at user reviews for any complaints around downtime or interruptions. Having string documentation is also typically a sign that a connector is frequently updated.
In some cases, tool providers will use a grading system to guide you as to which connectors are higher quality. For example Airbyte connectors are divided into three grades: Generally Available (GA), Beta and Alpha. The GA connectors are well-tested and robust, while the Beta and Alpha connectors refer to those in different stages of development, where Alpha connectors should not be used for development.
Tools vs Custom Development
Creating a custom data connector is ideal when you have specific, unique data integration requirements that are not met by standard EL tools. This approach allows for tailored data transformation, cleaning, and formatting to suit your unique use case, especially when dealing with proprietary or less common data sources. In industries with stringent compliance and security requirements, custom connectors allow you to implement and enforce specific standards and protocols.
The difficulties with custom connectors are that the development process can be lengthy, especially if the connector needs to interface with complex or poorly documented data sources. This can add cost and complexity to the process. Another issue is that Custom connectors need regular maintenance to ensure they continue to function correctly, especially when source systems update their APIs or data structures. Any change in the source system, such as API updates, new data fields, or schema changes, requires prompt updates to the connector, which can be time-consuming and complex. This is why 173tech advise that it is always wise to check if there are existing connectors you can utilise first.
Price
Pricing models vary for different tool providers but typically they are based on your data consumption. This might be based on the number of active rows you sync each month for example. This is why it is so important to consider which data is valuable to your organisation.
Whilst using existing connectors provided by tools has an ongoing cost associated with it, it is important to consider that the cost covers all of the updates and maintenance required by connectors. If you are considering creating your own connector, this would need to be handled by your internal resources.

Adrian Macias, VP Engineering
Whilst it may be “free” to create your own extractors, they will require a lot of upkeep and maintenance meaning solutions like fivetran may actually be more cost-effective.
Handling PII Data: Best Practices
Handling personally identifiable information (PII) is a big responsibility for any organisation. Ensuring the security and privacy of this data is essential, not only for maintaining trust but also to comply with legal regulations.
What is PII Data?
Personally identifiable information (PII) refers to any data that can be used to identify an individual uniquely. This includes direct identifiers, such as names, phone numbers, email addresses, as well as indirect identifiers that, when combined with other information, can reveal an individual’s identity. For example, a person’s date of birth, zip code, and gender might not be identifiable on their own, but together they could pinpoint an individual. PII encompasses a wide range of data types, such as home addresses, phone numbers, IP addresses, and even biometric data like fingerprints. Proper handling and protection of PII are essential to prevent identity theft and ensure privacy.
What Are The Consequences?
While it is hard to quantify the negative impact to brand perception, loss of trust and loss of long-term customers, from a GDPR fining perspective the amount can be up to 20 million euros, or in the case of an undertaking, up to 4 % of the companies’ total global turnover of the preceding fiscal year, whichever is higher. There have been a number of high-profile cases in recent history including a $700 million settlement from Equifax in 2017, A $26 million fine to British Airways in 2018 and many of us will recall the Cambridge Analytica scandal, which led to Facebook facing a $5 billion fine from the FTC in 2019.
While these are of course, big, headline-hitting examples, more than 2086 fines have been administered due to GDPR violations in the UK, so do not assume it couldn’t happen to your company! So how can you ensure that you do not fall foul of the rules when it comes to PII and your data…
Avoid Extracting PII
The most effective way to mitigate risks associated with PII is to avoid extracting it whenever possible. The primary objective of analytics is to derive insights from data at scale, not at the individual level. By excluding PII from the extraction process, organisations can significantly reduce the risk of unauthorised access or data breaches. Use anonymised or aggregated data to perform analytics tasks, ensuring individual identities are protected. Regularly evaluate whether PII is genuinely required for analytics or if alternative, non-identifiable data can be used instead.
Hashing PII for Analytics
When PII is essential for certain types of analytics, data activation, or AI models, applying hashing algorithms to PII fields is recommended. Hashing is a process that transforms data into a fixed-size string of characters, which is nearly impossible to reverse-engineer, thus protecting the original information.
For example, if your CRM system uses email addresses to identify users, you can hash these emails before storing or processing them. This ensures that even in the event of a data breach, the hashed values are not useful to an attacker. Ensure the use of robust hashing algorithms such as SHA-256 to enhance security or add a unique value (salt) to each PII field before hashing to prevent attackers from using precomputed hash tables (rainbow tables) to crack the hashes.
Isolate and Encrypt PII Data
If storing PII data is unavoidable, isolating and encrypting the data are critical steps to safeguard it. Encrypting sensitive data points ensures that even if unauthorised individuals gain access to the data, they cannot read it without the decryption keys. Use strong encryption methods, such as AES-256, to secure PII data both in transit and at rest. Limit access to decryption keys to a select group of trusted individuals within the organisation. Implement role-based access control (RBAC) to enforce these restrictions. Conduct regular audits to ensure compliance with data protection policies and to identify any potential vulnerabilities.
Training and Awareness
Providing regular training and awareness programmes for employees is essential to ensure they understand the importance of protecting PII, and any considerations that have gone into your current analytics set up.