📚 The Future of the Modern Data Stack in 2023
Featuring 4 new emerging trends and 6 big trends from last year
This article was co-written with Christine Garcia (Director of Content at Atlan).
In my annual tradition, I spent some downtime at the end of 2022 reflecting on what happened in the data world. As you can tell by the length of this article, my co-author Christine and I had a lot of thoughts.
For the last few years, data has been in hypergrowth mode. The community has been stirring up controversy with hot takes, debating the latest tech, raising important conversations, and duking it out on Twitter with Friday fights. The data world seemed infinite, and everyone was just trying to keep up as it exploded.
Now data is entering a different world. 98% of CEOs expect a recession for the next 12–18 months. Companies are preparing for war by amping up the pressure, laying people off, cutting budgets, and shifting from growth mode to efficiency mode.
So what does this mean for the data world? And, more importantly, for data leaders and practitioners? This article breaks down the 10 big trends for 2023 that anyone in the data space should know about—4 emerging trends that will be a big deal, and 6 existing trends that are poised to grow even further.
With the recent economic downswing, the tech world is looking into 2023 with a new focus on efficiency and cost-cutting. This will lead to four new trends related to how modern data stack companies and data teams operate.
Storage has always been one of the biggest costs for data teams. For example, Netflix spent $9.6 million per month on AWS data storage. As companies tighten their budgets, they’ll need to take a hard look at these bills.
Snowflake and Databricks have already been investing in product optimization. They’ll likely introduce even more improvements to help customers cut costs this year.
For example, in its June conference, Snowflake highlighted product improvements to speed up queries, reduce compute time, and cut costs. It announced 10% average faster compute on AWS, 10–40% faster performance for write-heavy DML workloads, and 7–10% lower storage costs from better compression.
At its June conference, Databricks also devoted part of its keynote to cost-saving product improvements, such as the launches of Enzyme (an automatic optimizer for ETL pipelines) and Photon (a query engine with up to 12x better price to performance).
Later in the year, both Snowflake and Databricks doubled down by investing further in cost optimization features, and more are sure to come next year. Snowflake even highlighted cost-cutting as one of its top data trends for 2023 and affirmed its commitment to minimizing cost while increasing performance.
In 2023, we’ll also see the growth of tooling from independent companies and storage partners to further reduce data costs.
Modern data stack partners will also likely introduce compatible optimization features, like dbt’s incremental models and packages. dbt Labs and Snowflake even wrote an entire white paper together on optimizing your data with dbt and Snowflake.
Metadata also has a big role to play here. With a modern metadata platform, data teams can use popularity metrics to find unused data assets, column-level lineage to see when assets aren’t connected to pipelines, redundancy features to delete duplicate data, and more. Much of this can even be automated with active metadata, like automatically optimizing data processing or purging stale data assets.
For example, a data team we work with reduced their monthly storage costs by $50,000 just by finding and removing an unused BigQuery table. Another team deprecated 30,000 unused assets (or two-thirds of their data estate) by finding tables, views, and schemas that weren’t used upstream.
Jake Thomas on Twitter: "Data engineers should not write ETL.Instead, they should spend ~80% of their day keeping airflow running, ~10% cramming dbt manifest.json into dags, and ~10% figuring out how to not get yelled at by finance for the Snowflake bill. / Twitter"
Data engineers should not write ETL.Instead, they should spend ~80% of their day keeping airflow running, ~10% cramming dbt manifest.json into dags, and ~10% figuring out how to not get yelled at by finance for the Snowflake bill.
In the past few years, data teams have been able to run free with less regulation and oversight.
Companies have so much belief in the power and value of data that data teams haven’t always been required to prove that value. Instead, they’ve chugged along, balancing daily data work with forward-looking tech, process, and culture experiments. Optimizing how data people work has always been part of the data discussion, but it’s often relegated to more pressing concerns like building a super cool tech stack.
Next year, this will no longer cut it. As budgets tighten, data teams and their stacks will get more attention and scrutiny. How much do they cost, and how much value are they providing? Data teams will need to focus on performance and efficiency.
In 2023, companies will get more serious about measuring data ROI, and data team metrics will start becoming mainstream.
It’s not easy to measure ROI for a function as fundamental as data, but it’s more important than ever that we figure it out.
This year, data teams will start developing proxy metrics to measure their value. This may include usage metrics like data usage (e.g. DAU, WAU, MAU, and QUA), page views or time spent on data assets, and data product adoption; satisfaction metrics like a d-NPS score for data consumers; and trust metrics like data downtime and data quality scores.
Sarah Catanzaro on Twitter: "Tell me how you're USING data. I'm sick of hearing about how you're producing data or building data stacks. As one great data scientist once said, the only stacks that matter are those of benjamins. pic.twitter.com/c0Ht5mPHQb / Twitter"
Tell me how you're USING data. I'm sick of hearing about how you're producing data or building data stacks. As one great data scientist once said, the only stacks that matter are those of benjamins. pic.twitter.com/c0Ht5mPHQb
For years, the modern data stack has been growing. And growing. And growing some more.
As VCs pumped in millions of dollars in funding, new tools and categories popped up every day. But now, with the economic downturn, this growth phase is over. VC money has already been drying up — just look at the decrease in funding announcements over the last six months.
We’ll see fewer data companies and tools launching next year and slower expansion for existing companies. Ultimately, this is probably good for buyers and the modern data stack as a whole.
Yes, hypergrowth mode is fun and exciting, but it’s also chaotic. We used to joke that it would suck to be a data buyer right now, with everyone claiming to do everything. The result is some truly wild stack diagrams.
This lack of capital will force today’s data companies to focus on what matters and ignore the rest. That means fewer “nice to have” features. Fewer splashy pivots. Fewer acquisitions that make us wonder “Why did they do that?”
With limited funds, companies will have to focus on what they do best and partner with other companies for everything else, rather than trying to tackle every data problem in one platform. This will lead to the creation of the “best-in-class modern data stack” in 2023.
As the chaos calms down and data companies focus on their core USP, the winners of each category will start to become clear.
These tools will also focus on working even better with each other. They’ll act as launch partners, aligning behind common standards and pushing the modern data stack forward. (A couple of examples from last year are Fivetran’s Metadata API and dbt’s Semantic Layer, where close partners like us built integrations in advance and celebrated the launch as much as Fivetran and dbt Labs.)
These partnerships and consolidation will make it easier for buyers to choose tools and get started quickly, a welcome change from how things have been.
Seth Rosen on Twitter: "Previously: the simple "modern data stack" when first discovered by data teams was *magic*. It made it absurdly easy to quickly centralize, transform, and present data for analyticsNow: the MDS is a landscape of hundreds of logos across all of "data"RIP MDS (magic data stack) / Twitter"
Previously: the simple "modern data stack" when first discovered by data teams was *magic*. It made it absurdly easy to quickly centralize, transform, and present data for analyticsNow: the MDS is a landscape of hundreds of logos across all of "data"RIP MDS (magic data stack)
Tech companies are facing new pressure to cut costs and increase revenue in 2023. One way to do this is by focusing on their core functions, as mentioned above. Another way is seeking out new customers.
Guess what the largest untapped source of data customers is today? Enterprise companies with legacy, on-premise data systems. To serve these new customers, modern data stack companies will have to start supporting legacy tools.
In 2023, the modern data stack will start to integrate with Oracle and SAP, the two enterprise data behemoths.
This may sound controversial, but it’s already begun. The modern data stack started reaching into the on-prem, enterprise data world over a year ago.
In October 2021, Fivetran acquired HVR, an enterprise data replication tool. Fivetran said that this would allow it to “address the massive market for modernizing analytics for operational data associated with ERP systems, Oracle databases, and more”. This was the first major move from a modern data stack company into the enterprise market.
Matthew Mullins on Twitter: "Every modern data stack diagram has one dwh at the center, but every enterprise I've ever talked to has at least three. The one that's current, the one they're moving to, and the one they got in an acquisition. / Twitter"
Every modern data stack diagram has one dwh at the center, but every enterprise I've ever talked to has at least three. The one that's current, the one they're moving to, and the one they got in an acquisition.
These are six of the big ideas that blew up in the data world last year and only promise to get bigger in 2023.
This was one of the big trends from last year’s article, so it’s not surprising that it’s still a hot topic in the data world. What was surprising, though, was how fast the ideas of active metadata and third-generation data catalogs continued to grow.
In a major shift from 2021, when these ideas were new and few people were talking about them, many companies are now competing to claim the category.
Metadata is seen as one of the big gaps in the data world, so even as VC funding started to dry up, there were some big raises in the cataloging space last year. These included Alation’s $123M Series E, Data.world’s $50M Series C, our $50M Series B, and Castor’s $23.5M Series A.
The other cause for this growth was analysts, who embraced and amplified the ideas of active metadata and modern data catalogs in 2022. For example, Gartner went all in on active metadata at its annual conference and G2 released a new “Active Metadata” category.
Our prediction is that active metadata platforms will replace the “data catalog” category in 2023.
This actually started last year when Forrester renamed its Wave report on “Machine Learning Data Catalogs” and reversed its rankings. It moved the 2021 Leaders (Alation, IBM, and Collibra) to the bottom and middle tiers of its 2022 Wave report, and replaced them with a new set of companies (us, Data.world, and Informatica).
The “data catalog” is just a single use case of metadata: helping users understand their data assets. But that barely scratches the surface of what metadata can do.
Activating metadata holds the key to dozens of use cases like observability, cost management, remediation, quality, security, programmatic governance, optimized pipelines, and more — all of which are already being actively debated in the data world. Here are a few real examples:
- Eventbridge event-based actions: Allows data teams to create production-grade, event-driven metadata automations, like alerts when ownership changes or auto-tagging classifications.
- Trident AI: Uses the power of GPT-3 to automatically create descriptions and READMEs for new data assets, based on metadata from earlier assets.
- GitHub integration: Automatically creates a list of affected data assets during each GitHub pull request.
This started in August with Chad Sanderson’s newsletter on “The Rise of Data Contracts”. He later followed this up with a two-part technical guide to data contracts with Adrian Kreuziger. He then spoke about data contracts on the Analytics Engineering Podcast — with us! (Shoutout to Chad, Tristan Handy, and Julia Schottenstein for a great chat.)
The core driver of data contracts is that engineers have no incentive to create high-quality data.
Because of the modern data stack, the people who create data have been separated from the people who consume it. As a result, companies end up with GIGO data systems — garbage in, garbage out.
The data contract aims to solve this by creating an agreement between data producers and consumers. Data producers commit to producing data that adheres to certain rules — e.g. a set data schema, SLAs around accuracy or completeness, and policies on how the data can be used and changed. After agreeing on the contract, data consumers can create downstream applications with this data, assured that engineers won’t unexpectedly change the data and break live data assets.
After Chad Sanderson’s newsletter went live, this conversation blew up. It spread across Twitter and Substack, where the data community argued whether data contracts were an important conversation, frustratingly vague or self-evident, not actually a tech problem, doomed to fail, or obviously a good idea. People hosted Twitter fights, created epic threads, and watched battle royales from a safe distance, popcorn in hand.
While data contracts are an important issue in their own right, they’re part of a larger conversation about how to ensure data quality.
It’s no secret that data is often outdated or incomplete or incorrect — the data community has been talking about how to fix it for years. First, people said that metadata documentation was the solution, then it was data product shipping standards. Now the buzzword is data contracts.
This is not to dismiss data contracts, which may be the solution we’ve been waiting for. But it seems more likely that data contracts will be subsumed in a larger trend around data governance.
In 2023, data governance will start shifting “left”, and data standards will become a first-class citizen in orchestration tools.
For decades, data governance has been an afterthought. It’s often handled by data stewards, not data producers, who create documentation long after data is created.
However, we’ve recently seen a shift to move data governance “left”, or closer to data producers. This means that whoever creates the data (usually a developer or engineer) must create documentation and check the data against pre-defined standards before it can go live.
Major tools have recently made changes that support this idea, and we expect to see even more in the coming year:
- dbt’s yaml files and Semantic Layer, where analytics engineers can create READMEs and define metrics while creating a dbt model
- Airflow’s Open Lineage, which tracks metadata about jobs and datasets as DAGs execute
- Fivetran’s Metadata API, which provides metadata for data synced by Fivetran connectors
- Atlan’s GitHub extension, which creates a list of downstream assets that will be affected by a pull request
Alex Dean on Twitter: "It's technically not a Data Contract unless it comes from the Côte d'Ata region of France. Otherwise it's just a sparkling schematization. / Twitter"
It's technically not a Data Contract unless it comes from the Côte d'Ata region of France. Otherwise it's just a sparkling schematization.
Also called a “metrics layer” or “business layer”, the semantic layer is an idea that’s been floating around the data world for decades.
The semantic layer is a literal term — it’s the “layer” in a data architecture that uses “semantics” (words) that the business user will understand. Instead of raw tables with column names like “A000_CUST_ID_PROD”, data teams build a semantic layer and rename that column “Customer”. Semantic layers hide complex code from business users while keeping it well-documented and accessible for data teams.
In October 2022, dbt Labs made a big splash at its annual conference by announcing the new Semantic Layer.
The core concept behind dbt’s Semantic Layer: define things once, use them anywhere. Data producers can now define metrics in dbt, then data consumers can query those consistent metrics in downstream tools. Regardless of which BI tool they use, analysts and business users can look up a stat in the middle of a meeting, confident that their answer will be correct.
Making metrics part of data transformation intuitively makes sense. Making them part of dbt — the dominant transformation tool, which is already well-integrated with the modern data stack — is exactly what the semantic layer needed to go from idea to reality.
Since dbt’s Semantic Layer launched, progress has been fairly measured — in part because this happened less than three months ago, and in part because changing the way that people write metrics will take time.
In 2023, the first set of Semantic Layer implementations will go live.
Many data teams have spent the last couple of months exploring the impact of this new technology — experimenting with the Semantic Layer and thinking through how to change their metrics frameworks.
This process gets easier as more tools in the modern data stack integrate with the Semantic Layer. Seven tools were Semantic Layer–ready at its launch (including us, Hex, Mode, and Thoughtspot). Eight more tools were Metrics Layer–ready, an intermediate step to integrating with the Semantic Layer.
Robert Yi 🐳 on Twitter: "Okay let's do this:Who's going to win the battle for the semantic layer?(comment if someone else) / Twitter"
Okay let's do this:Who's going to win the battle for the semantic layer?(comment if someone else)
In 2022, some of the main players in reverse ETL (one of last year’s big trends) sought to redefine their category as “data activation”, a new take on the “customer data platform”.
A CDP combines data from all customer touchpoints (e.g. website, email, social media, help center, etc). A company can then segment or analyze that data, build customer profiles, and power personalized marketing. For example, they can create an automated email with a discount code if someone abandons their cart, or advertise to people who have visited a specific page on the website and used the company’s live chat.
CDPs are designed around using data, rather than simply aggregating and storing it. This is where data activation comes in — “activating” data from the warehouse to handle CDP functions.
Data activation in various forms has been around for a few years. However, this idea of data activation as the new CDP took off in 2022.
For example, Arpit Choudhury analyzed the space in April, Sarah Krasnik broke down the debate in July, Priyanka Somrah included it as a data category in August, and Luke Lin called out data activation in his 2023 data predictions last month.
In part, this trend was caused by marketing from former reverse ETL companies, who now brand themselves as data activation products. For example, Hightouch rebranded itself with a big splash in April, dropping three blogs on data activation in five days:
- Data Activation: The Next Step After Analytics by Pedram Navid
- Hightouch: The Data Activation Platform by Kashish Gupta
- What is Data Activation? by Luke Kline
In part, this can also be traced to the larger debate around driving data use cases and value, rather than focusing on data infrastructure or stacks. As Benn Stancil put it, “Why has data technology advanced so much further than value a data team provides?”
In part, this was also an inevitable result of the modern data stack. Stacks like Snowflake + Hightouch have the same data and functionality as a CDP, but they can be used across a company rather than for only one function.
CDPs made sense in the past. When it was difficult to stand up a data platform, having an out-of-the-box, perfectly customized customer data platform for business users was a big win.
Now, though, the world has changed, and companies can set up a data platform in under 30 minutes — one that not only has customer data, but also all other important company data (e.g. finance, product/users, partners, etc).
At the same time, data work has been consolidating around the modern data stack. Salesforce once tried to handle its own analytics (called Einstein Analytics). Now it has partnered with Snowflake, and Salesforce data can be piped into Snowflake just like any other data source.
The same thing has happened for most SaaS products. While internal analytics was once their upsell, they are now realizing that it makes more sense to move their data into the existing modern data ecosystem. Instead, their upsell is now syncing data to warehouses via APIs.
In this new world, data activation becomes very powerful. The modern data warehouse plus data activation will replace not only the CDP, but also all pre-built, specialized SaaS data platforms.
With the modern data stack, data is now created in specialized SaaS products and piped into storage systems like Snowflake, where it is combined with other data and transformed in the API layer. Data activation is then crucial for piping insights back into the source SaaS systems where business users do their daily work.
For example, Snowflake acquired Streamlit, which allows people to create pre-built templates and templates on top of Snowflake. Rather than developing their own analytics or relying on CDPs, tools like Salesforce can now let their customers sync data to Snowflake and use a pre-built Salesforce app to analyze the data or do custom actions (like cleaning a lead list with Clearbit) with one click. The result is the customization and user-friendliness of a CDP, combined with the power of modern cloud compute.
Jessica Laughlin on Twitter: "looking for funding for my new startup idea: Reverse Data Activation https://t.co/VLbUhYx8Yp / Twitter"
looking for funding for my new startup idea: Reverse Data Activation https://t.co/VLbUhYx8Yp
Here’s the TL;DR from Data Mesh Learning Community: “The shortest summary: treat data as a product, not a by-product. By driving data product thinking and applying domain driven design to data, you can unlock significant value from your data. Data needs to be owned by those who know it best.”
The data mesh was everywhere in 2021. In 2022, it started to move from abstract idea to reality.
The data mesh conversation has shifted from “What is it?” to “How can we implement it?” as real user stories grew and these four pillars became less abstract and more actionable.
Meanwhile, companies started to brand themselves around the data mesh. So far, we’ve seen this with Starburst, Databricks, Oracle, Google Cloud, Dremio, Confluent, Denodo, Soda, lakeFS, and K2 View, among others.
Four years after it was created, we’re still in the early phases of the data mesh.
In 2023, the first wave of data mesh “implementations” will go live, with the “data as a product” concept front and center.
This year, we’ll start seeing more and more real data mesh architectures — not the aspirational diagrams that have been floating around data blogs for years, but real architectures from real companies.
The data world will also start to converge on a best-in-class reference architecture and implementation strategy for the data mesh. This will include the following core components:
- Metadata platform that can integrate into developer workflows (e.g. Atlan’s APIs and GitHub integration)
- Data quality and testing (e.g. Great Expectations, Monte Carlo)
- Git-like process for data producers to incorporate testing, metadata management, documentation, etc. (e.g. dbt)
- All built around the same central data warehouse/lakehouse layer (e.g. Snowflake, Databricks)
John Cutler on Twitter: "the opposite of data mesh isdata mehproducers do whatever they want, and some poor soul needs to make sense of it all so it can be consumed / Twitter"
the opposite of data mesh isdata mehproducers do whatever they want, and some poor soul needs to make sense of it all so it can be consumed
One of our big trends from last year, data observability held its own and continued to grow alongside adjacent ideas like data quality and reliability.
Existing companies became even bigger (e.g. Databand’s acquisition by IBM in July 2022) and new companies went mainstream (e.g. Kensu jumped into thought leadership), and new tools launched every month (e.g. Bigeye’s Metadata Metrics).
In a notable change, this space also saw significant open-source growth in 2022.
Datafold launched an open-source diff tool, Acceldata open-sourced its data platform and data observability libraries, and Soda launched both its open-source Soda Core and enterprise Soda Cloud platforms.
One of our open questions in last year’s report was where data observability was heading — towards its own category, or merging with another category like data reliability or active metadata.
Data observability and quality will converge in a larger “data reliability” category centered around ensuring high-quality data.
This may seem like a big change, but many of the companies in these categories have changed names multiple times over the years, such as Datafold going from data diffs to a “data reliability platform”.
As these companies compete to define and own the category, we’ll continue to see more confusion in the short term. However, there are early signs that this will start to settle down into one category in the near future.
Sarah Catanzaro on Twitter: "Are we turning a page? I sense a growing recognition among data Twitter that perhaps it's the data and not the tools, people, or titles that's the source of our collective grief? / Twitter"
Are we turning a page? I sense a growing recognition among data Twitter that perhaps it's the data and not the tools, people, or titles that's the source of our collective grief?
It feels interesting to welcome 2023 as data practitioners. While there’s a lot of uncertainty looming in the air (uncertainty is the new certainty!), we’re also a bit relieved.
2021 and 2022 were absurd years in the history of the data stack.
The hype was crazy, new tools were launching every day, data people were constantly being poached by data startups, and VCs were throwing money at every data practitioner who even hinted at building something. The “modern data stack” was finally cool, and the data world had all the money and support and acknowledgment it needed.
At Atlan, we started as a data team ourselves. As people who have been in data for over a decade, this was a wild time. Progress is generally made in decades, not years. But in the last three years, the modern data stack has grown and matured as much as in the decade before.
It was exciting… yet we ended up asking ourselves existential questions more than once. Is this modern data stack thing real, or is it just hype fueled by VC money? Are we living in an echo chamber? Where are the data practitioners in this whole thing?
While this hype and frenzy led to great tooling, it was ultimately bad for the data world.
Confronted by a sea of buzzwords and products, data buyers often ended up confused and could spend more time trying to get the right stack than actually using it.
Let’s be clear — the goal of the data space is ultimately to help companies leverage data. Tools are important for this. But they’re ultimately an enabler, not the goal.
As this hype starts to die down and the modern data stack starts to stabilize, we have the chance to take the tooling progress we’ve made and translate it into real business value.
We’re at a point where data teams aren’t fighting to set up the right infrastructure. With the modern data stack, setting up a data ecosystem is quicker and easier than ever. Instead, data teams are fighting to prove their worth and get more results out of less time and resources.
Now that companies can’t throw money around, their decisions need to be targeted and data-driven. This means that data is more important than ever, and data teams are in a unique position to provide real business value. But to make this happen, data teams need to finally figure out this “value” question.
Now that we’ve got the modern data stack down, it’s time to figure out the modern data culture stack. What does a great data team look like? How should it work with business? How can it drive the most impact in the least time?
These are tough questions, and there won’t be any quick fixes. But if the data world can crack the secrets to a better data culture, we can finally create dream data teams — ones that will not just help their companies survive during the next 12–18 months, but propel them to new heights in the coming decades.
Ready for spicy takes on these ideas? On February 24, a panel of data superstars (Bob Muglia, Barr Moses, Benn Stancil, Douglas Laney, and Tristan Handy) will debate the future of data in 2023. Save your spot for the next Great Data Debate.