Reducing Carbon Footprint (& Cost) From Your Data Usage
“I spent the last two decades helping companies leverage data to drive growth and success. Since the early days of the millennium, the data landscape has changed dramatically, from descriptive analytics to prediction algorithms and now generative AI. The one constant has been our insatiable need to crunch ever more data. ” Candice Ren
This growing consumption comes with a hidden cost, not in dollar bills to individual companies but on the impact on our planet. According to the International Energy Agency, data centres account for 1-1.5% of global electricity demand and were responsible for 1% of energy-related greenhouse gas (GHG) emissions in 2020. Other research estimates that data centres will contribute to 3.2% of the global carbon emissions by 2025.
In this practical guide, we share our top recommendations on how you can reduce your CO2 emissions.
The silver-lining: not only do these tips help you make a positive impact on the environment, they will also reduce your overall data infrastructure cost. Win-win!
Move To Cloud
It is like taking public transport. You use resources more efficiently by sharing with others. Cloud solutions are quick to set up, easy to scale up and down based on your usage needs and very cost-effective. Three top cloud providers cover around two-third of the entire market share: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). We will come back to each of their carbon practices later.
There are benefits with on-premises data centres, e.g. complete control over your infrastructure and customised security measures. However, we would argue that only a small handful of organisations are large enough or have very specific needs to require on-prem solutions.
If you have not yet moved to the cloud, this should be the first area to investigate.
Conceptually, analytics has two technical components: storage and compute. What lean analytics aims to achieve is less data and less processing.
What data do you actually need?
One common mistake we see in data strategies is looking first at what data you have and what you could do with it. This approach typically results in an oversized, fragmented and incoherent data real estate with limited business value. We see data teams buried under maintenance tasks with vast amounts of data processes and ad-hoc reporting requests, as business users struggle to access meaningful insights on their own.
We recommend always starting with your business goals and questions. What is your North Star and what key factors contribute to it? The answers will guide your data development, ensure you only bring in data that is valuable for decision-making, and that KPIs are organised in a way easy to digest by business users.
Do you need it real-time?
The other potential wastage we see is in (near) real-time data processes. There is no doubt certain information is crucial to have real-time, e.g. stock market data. However, when it comes to analytics, the principal question to ask is how frequently do you need the insights to make and execute decisions. In most business use cases, a daily ETL process is more than sufficient to bring you the latest trends.
ETL is the process of Extracting data from various sources (e.g. sales data from Shopify or your backend systems, CRM data from Iterable or Braze, website and app usage data from Google Analytics 4 etc), Transforming it based on your unique business logics and Loading it into a centralised repository and your single source of truth (e.g. your data warehouse). You may also come across the term ELT, which loads the raw data to the storage first then transforms it.
A serverless infrastructure provisions resources only as required, i.e. you do not need to manage and keep the server running 24/7. It is an on-demand solution. You only consume energy when the service is in use, reducing system idle-time and hence, wastage. You can also scale it up and down as needed, removing unnecessary surplus resources.
GCP and Azure both offer a serverless architecture. AWS has an application called AWS Lambda that allows you to run code in response to events and manages the resources required automatically.
Lakehouse Data Architecture
A lakehouse is a hybrid between a data lake and data warehouse. It has the scalability and cost-effectiveness of data lakes, plus the benefit of a fully structured model of data warehouses. It does so while still having support for semi-structured and unstructured data. It also has the added benefit of having one system, allowing you to run business intelligence (BI), machine learning (ML) and artificial intelligence (AI) workloads all within the same platform.
From an energy perspective, lakehouses (e.g. Snowflake, BigQuery, Databricks) are serverless and provision resources as needed. If you only use your lakehouse for 1 hour, you consume electricity for that 1 hour.
For example, Snowflake allows you to automatically suspend your warehouse after a period of inactivity, e.g. 5 minutes (this number should be determined by your usage pattern) and resume the provision at your next request. It also does not consume any credits when your warehouse is suspended.
Furthermore, in a lakehouse structure, storage is decoupled from compute. It allows for more flexibility and opportunities to streamline processes and reduce energy consumption.
There are two main factors that contribute to the carbon footprint of your data centre. First is the type of energy sources used by the facility. Data centres powered by clean energies, such as wind and solar, are better for the environment. Many cloud providers have some form of climate pledge and green energy claims. However, in some cases these claims are achieved by using offsets. For example, if the facility uses fossil fuels and generates GHG, it can purchase carbon credits to offset it. This may sound artificial to you and it is. A better approach is to move to renewable energy sources to power the data centres.
Want to learn more about carbon offset? Here is a hilarious yet informative video from John Oliver on the topic.
The second factor is called the Power Usage Effectiveness (PUE). PUE measures the energy efficiency of a data centre. It is calculated by dividing the total amount of power entering a facility by the power used to run the IT equipment. A perfect PUE ratio of 1.0 means all of the facility’s power is delivered to IT equipment. However, you need energy for other processes such as cooling systems, lights and other equipment.
Out of the three key cloud providers mentioned above, GCP has the best reporting standard when it comes to their practices on clean energy, followed by Microsoft Azure. Google reports a 1.10 PUE across its data centres and aims to operate on carbon-free energy 24/7 by 2030.
When selecting the region for your Google Cloud services, GCP will highlight the ones with the lowest carbon impact with an “Low CO2” icon.
Keep Track Of Your Usage
Finally, a detailed understanding of your data usage and its carbon footprint can help you monitor trends and identify efficiency gains. Check out the tools offered by cloud providers: AWS, Microsoft Azure, and GCP.
If you have other ideas or are interested to learn more about reducing your data carbon footprint and costs, please reach out or comment below. I would love to chat!