More than 3 years have passed since last update.

Must Know Trends about Modern Data Stack

Last updated at 2022-04-19Posted at 2022-04-19

Modern Data Stack

In the data world, the modern data stack has been talking for recent years. What is the modern data stack?

To analytics engineers, it’s a transformational shift in technology and company organization. To startup founders, it’s a revolution in how companies work.
To VCs, it’s a $100 billion opportunity. To engineers, it’s a dynamic architectural roadmap
To Gartner, it’s the foundation of a new data and analytics strategy
To thought leaders, it’s a data mesh
To an analyst with an indulgent blog on the internet, it’s a new orientation, a new nomenclature, and a bunch of other esoteric analogies that only someone living deep within their own navel would care about.

From Ananth Packkildurai

MDS is a set of vendor tools that solve niche data problems (lineage, orchestration, quality) with the side effect of creating a disjointed data workflow that makes data folks lives’ more complicated.

From Fivetran

A radically new approach to data integration saves engineering time, allowing engineers and analysts to pursue higher-value activities.

Work Like an Analyst

Data Analysis Democratization

We thought only people who has data- title need to analyze data, in the real world, however, there are many low-code/no-code tools to empower other people from business to techniques to do the certain analysis by own demands. The data analysis is being democratized essentially in organizations, whereas these diversified tools complicate the modern data stack.

Analysis is becoming a team sport. Business stakeholders aren’t kicking questions over to their data help desks and waiting to get reports back to review; everyone is contributing together.

Tristan Handy predicted that analytical talent wouldn’t be concentrated in a “select few specialists,” but would “live inside of operational areas of the business.”

Analysis becoming multi-functional; the democratization of analytical reasoning and technical talent, not just data—is far more foundational than whatever reshuffling happens among the data stack’s middleware.

I’m now realizing that misses the real point: The divisions of labor between analysts and everyone else are fading. Analysis is getting bundled with other functions; the behaviors of analysts and non-analysts are overlapping; analysts are becoming positionless

The reason to build a “modern data experience” isn’t to unify the disjointed products of a bunch of startups; it’s to serve a world in which far more people want to work like analysts.

The Unbundling of Airflow

Heavy users of Airflow can do a vast variety of data related tasks without leaving the platform; from extract and load scripts to generating reports, transformations with Python and SQL to syncing back data to BI tools.

Before the fragmentation of the data stack, it wasn’t uncommon to create end-to-end pipelines with Airflow. Organizations used to build almost entire data workflows as custom scripts developed by in-house data engineers. Bigger companies even built their own frameworks inside Airflow, for example frameworks with dbt-like functionality for SQL transformations in order to make it easier for data analysts to write these pipelines.

Today, data practitioners have many tools under their belt and only very rarely they have to reach for a tool like Airflow. Fivetran and Airbyte took care of the extract and load scripts that one might write with Airflow. dbt came for the data transformations, Census and Hightouch for Reverse ETL. Metrics and experimentation layers are also getting their own focused tooling; metrics with tools like Transform, Metriql, Supergrain and experimentation with Eppo. Certain companies relied on Airflow for data science and ML workloads, but with the popularity of MLOps, that layer is also being abstracted out. Open source tools like Feast are unbundling best practices for feature management that used to exist as independent Python scripts.

The Epic Debate: Bundling or Unbundling

The viewpoint of bundling

Most importantly, customers will start demanding less complexity as they make choices about their data stack. This is where bundling will start to win.

We need to merge both the model and task execution unit into one unit. Otherwise, any abstraction we build without the unification will further amplify the disorganization of the data.

The viewpoint of unbundling

The only reality in the data world is diversity — data engineers, analysts, analytics engineers, data scientists, product managers, business analysts, citizen data scientists, and more.
Each of these people have their own favorite and equally diverse data tools, everything from SQL, Looker, and Jupyter to Python, Tableau, dbt, and R. And data projects have their own technical requirements and peculiarities — some need real-time processing while some need speed for ad-hoc analysis, leading to a whole host of data infrastructure technologies (warehouses, lakehouses, and everything in between).

We believe that the key to helping our data stack work together is in activating metadata. One use case for metadata activation could be as simple as notifying downstream consumers of upstream changes.
When a data store changes:

Refresh metadata: Crawl the data store to retrieve its updated metadata.

Detect changes: Compare the new metadata against the previous metadata. Identify any changes that could cause an impact — adding or removing columns, for example.

Find dependencies: Use lineage to find users of the data store. These could include transformation processes, other data stores, BI dashboards, and so on.

Notify consumers: Notify each consumer through their preferred communication channel — Slack, Jira, etc.

The viewpoint of the cycle of the both

The challenge with a fully bundled stack is that resources are always limited and innovation stalls. This gap will create an opportunity for unbundling, and so I believe we’ll go through cycles of bundling and unbundling.

Introducing Software-Defined Assets

One thing worths to be mentioned that Dagster advocates declarative approaches by introducing software-defined assets that bring many advantages to the modern data stack.

Declarative approaches are appealing because they make systems dramatically more debuggable, comprehensible, and automate-able. They do this by making intentions explicit and by offering a principled way of managing change. By explicitly defining what the world should look like — e.g. by specifying the set of servers that should exist — it becomes easy to discover when it does not look like that, reason about why, and reconcile.

Using code to define the data assets that you want to exist. These asset definitions, version-controlled through git and inspectable via tooling, allow anyone in your organization to understand your canonical set of data assets, allow you to reproduce them at any time, and offer a foundation for asset-based orchestration.

An asset definition can be invoked by the orchestrator to materialize that asset, i.e. to run the op on the contents of the upstream assets and then persist the results in storage, like a data warehouse, data lake, or ML model store.

In large part due to its embrace of declarative data management, the Modern Data Stack has brought immense quality-of-life improvements to data practitioners. But these improvements come with some glaring gaps:

Python and other non-SQL programming languages have largely been left behind. As soon as you want to transform data with Pandas or PySpark, write a custom ingest script, or train an ML model, you’re back to writing imperative code in Airflow.

As we discussed at length in our "Rebundling the Data Platform" post last week, segregating ingest, SQL transformation, and ML into different purpose-build tools can mean losing sight of the asset graph that spans all of them. The lack of a shared orchestration layer results in an operationally fragile data platform that fosters a constant state of confusion about what ran, what's supposed to run, and whether things ran in the right order.

The Modern Data Stack has already started to embrace declarative, asset-based principles. Because it defines dependencies and scheduling policies at the asset level instead of at the task level, an asset-based orchestrator is the ideal orchestrator for this stack. Dagster brings in heterogeneous compute and a unified control plane, without requiring practitioners to revert to tasks and imperative programming.

The dbt Viewpoint

dbt, the well-known data build tool in the data world. dbt enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.

The viewpoint of dbt is to build a mature analytics workflow.

Analytics

The center of gravity in mature analytics organizations has shifted away from proprietary, end-to-end tools towards more composable solutions made up of:

data integration scripts and/or tools,
high-performance analytic databases,
SQL, R, and/or Python, and
visualization tools.

Analytics is Collaborative

Analysis changes as data and businesses evolve, and it’s important to know who changed what, when.

Any code that generates data or analysis should be reviewed and tested.

Your analysis is a software application, and, like every other software application, people are going to have questions about how to use it.
Your code should come packaged with a basic description of how it should be interpreted, and your team should be able to add to that documentation as additional questions arise.

Think of the schema of a data set as its public interface. Create tables, views, or other data sets that expose a consistent schema and can be modified if business logic changes.

Analytic code, however, is often fragile. Changes in underlying data break most analytic code in ways that are hard to predict and to fix.

an automated workflow:

models and analysis are downloaded from multiple source control repositories,

code is configured for the given environment,

code is tested, and

code is deployed.

The Future of the Modern Data Stack in 2022

Metric Layer and Headless BI

Data Mesh

Its core idea is that companies can become more data-driven by moving from centralized data warehouses and lakes to a “domain-oriented decentralized data ownership and architecture” driven by self-serve data and “federated computational governance”.

The data mesh isn’t a platform or a service that you can buy off the shelf. It’s a design concept with some wonderful concepts like distributed ownership, domain-based design, data discoverability, and data product shipping standards — all of which are worth trying to operationalize in your organization.

Metrics Layer

It’s been called the metrics layer, metrics store, headless BI, and even more names than I can list here.

Airbnb announced that it had been building a home-grown metrics platform called Minerva
to solve this issue. Other prominent tech companies soon followed suit, including LinkedIn’s Unified Metrics Platform, Uber’s uMetric, and Spotify’s metrics catalog in their “new experimentation platform”.

a bunch of early stage startups have launched to compete for this space. Transform
is probably the biggest name so far, but Metriql, Lightdash, Supergrain, and Metlo
also launched this year. Some bigger names are also pivoting to compete in the metrics layer, such as GoodData’s foray into Headless BI.

Reverse ETL

This concept first started getting attention in February, when Astasia Myers
(Founding Enterprise Partner at Quiet Capital) wrote an article about the emergence of reverse ETL.

Hightouch and Census have dominated the reverse ETL discussion this year, but they’re not the only ones in the space. Other notable companies are Grouparoo, **HeadsUp**, Polytomic, Rudderstack, and Workato.

Metadata & Data Catalogs

This idea got amplified by a huge move Gartner made this year — scrapping its Magic Quadrant for Metadata Management Solutions and replacing it with the Market Guide for Active Metadata. In doing this, they introduced “active metadata” as a new category in the data space.
What’s the difference? Old-school data catalogs collect metadata and bring them into a siloed “passive” tool, aka the traditional data catalog. Active metadata platforms act as two-way platforms — they not only bring metadata together into a single store like a metadata lake, but also leverage “reverse metadata” to make metadata available in daily workflows.

Third-gen catalogs will leverage metadata to improve existing tools like Looker, dbt, and Slack, finally making the dream of an intelligent data management system a reality.

Data Teams as Product Teams

In 2021, Emilie Schario from Amplify Partners, Taylor Murphy from Meltano, and Eric Weber from Stitch Fix talked about a way to break data teams out of this trap —rethinking data teams as product teams. They first explained this idea with a blogon Locally Optimistic, followed by great talks at conferences like MDSCON, dbt Coalesce, and Future Data.

Data Observability

Data downtime has been a part of normal life on a data team for years. But now, with many companies relying on data for literally every aspect of their operations, it’s a huge deal when data stops working.

Yet everyone was just reacting to issues as they cropped up, rather than proactively preventing them. This is where data observability — the idea of “monitoring, tracking, and triaging of incidents to prevent downtime” — came in.

The space went from being non-existent to hosting a bunch of companies, with a collective $200m of funding raised in 18 months. This includes Acceldata, Anomalo, Bigeye, Databand, Datafold, Metaplane, Monte Carlo, and Soda. People even started creating lists of new “data observability companies” to help keep track of the space.

Ideally, if you have all your metadata in one open platform, you should be able to leverage it for a variety of use cases (like data cataloging, observability, lineage and more).

Emerging Architectures for Modern Data Infrastructure

Modern Business Intelligence

What’s new:

There has been a surge of interest in the metrics layer, a system providing a standard set of definitions on top of the data warehouse. This has been hotly debated, including what capabilities it should have, which vendor(s) should own it, and what spec it should follow. So far, we’ve seen several credible pure-play products (like Transform and Supergrain), plus expansion into this category by dbt.

Reverse ETL vendors have grown meaningfully, particularly Hightouch and Census. The purpose of these products is to update operational systems, like CRM or ERP, with outputs and insights derived from the data warehouse.

Data teams are showing stronger interest in new applications to augment their standard dashboards, especially data workspaces (like Hex). Broadly speaking, new apps are likely the result of increasing standardization in cloud data warehouses — once data is cleanly structured and easy to access, data teams naturally want to do more with it.

Data discovery and observability companies have proliferated and raised substantial amounts of capital (especially Monte Carlo and Bigeye). While the benefits of these products are clear — i.e. more reliable data pipelines and better collaboration — adoption is still relatively early, as customers discover relevant use cases and budgets. (Technical note: although there are several credible new vendors in data discovery — e.g. Select Star, Metaphor, Stemma, Secoda, Castor — we have excluded seed-stage companies from the diagram in general.)

Multimodal Data Processing

What’s new:

There is growing recognition and clarity for the lakehouse architecture. We’ve seen this approach supported by a wide range of vendors (including AWS, Databricks, Google Cloud, Starburst, and Dremio) and data warehouse pioneers. The fundamental value of the lakehouse is to pair a robust storage layer with an array of powerful data processing engines like Spark, Presto, Druid/Clickhouse, Python libraries, etc.

The storage layer itself is getting an upgrade. While technologies like Delta, Iceberg, and Hudi are not new, they are seeing accelerated adoption and are being built into commercial products. Some of these technologies (particularly Iceberg) also interoperate with cloud data warehouses like Snowflake. If heterogeneity is here to stay, this is likely to become a key part of the multimodal data stack.

There may be an uptick in adoption taking place for stream processing (i.e., real-time analytical data processing). While first-generation technologies like Flink still haven’t gone mainstream, new entrants with simpler programming models (like Materialize and Upsolver) are gaining early adoption, and, anecdotally, usage of stream processing products from incumbents Databricks and Confluent has also started to accelerate.

Artificial Intelligence and Machine Learning

What’s new:

The ML industry is consolidating around a data-centric approach, emphasizing sophisticated data management over incremental modeling improvements. This has several implications:

Rapid growth for data labeling (e.g. Scale and Labelbox) and growing interest in closed-loop data engines, largely modeled on Tesla’s Autopilot data pipelines.

Increased adoption for feature stores (e.g. Tecton), for both batch and real-time use cases, as a means to develop production-grade ML data in a collaborative way.

Revived interest in low-code ML solutions (like Continual and MindsDB) that at least partially automate the ML modeling process. These newer solutions focus on bringing new users (i.e. analysts and software developers) into the ML market.

Use of pre-trained models is becoming the default, especially in NLP, and providing tailwinds to companies like OpenAI and Hugging Face. There are still meaningful problems to solve here around fine-tuning, cost, and scaling.

Operations tools for ML (sometimes called MLops) are becoming more mature, built around ML monitoring as the most in-demand use case and immediate budget. Meanwhile, a raft of new operational tools — including, notably, validation and auditing — are appearing, with the ultimate market still to be determined.

There is increased focus on how developers can seamlessly integrate ML models into applications, including through pre-built APIs (e.g. OpenAI), vector databases (e.g. Pinecone), and more opinionated frameworks.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up