Data strategy has long been a core element of how businesses operate—but it still manages to burden and even bedevil companies today. Businesses leverage more and more technology each year, and each of these technologies provides rich opportunities to collect and take action on new data. The need for a comprehensive and forward-looking data strategy is clear. When you research data strategies, you are presented with a myriad of options, from data warehouse and data lake to data fabric and data mesh. So how do you know which option will work best for your business and your plans?
Historically, when you think about enterprise data, words like “agile” and “flexible” don’t come to mind. It has never been easier for data collection to spiral out of control, with so many disparate systems available to collect copious amounts of data,compounded by the relative ease of adding new systems to your business ecosystem.
These issues of scale lead to a lack of data cohesion, and by extension, your business processes and business intelligence that rely on this data will suffer. This is where your data strategy can help.
Strong data strategies can be game-changing for businesses. They enable you to wrangle harrowing data landscapes across modern and legacy systems efficiently and effectively.
Your data strategy sets the foundation that your analytics and workflows will be built on top of for years.
To inventory some of the current options for data strategy, let’s take a closer look at data warehouse, data lake, data fabric, and data mesh.
The oldest (and still popular) way businesses try to consolidate data is by regularly lifting data from each system it’s in and loading it all into one new system. This new system in effect would be the data warehouse or data lake. So what is the difference between a data warehouse and a data lake? Well, before you can understand that, it helps to know the difference between ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). Each term describes a process of migrating data and preparing data for use, so let’s briefly define the three different steps:
Both data warehouses and data lakes start with extraction, but that is where their processes diverge. A data warehouse leverages a defined structure, so the different data entities and relationships are codified directly in the data warehouse. For that reason, the extracted data from the source system needs to be transformed and processed so that it can be loaded into this structured format. A benefit of this structure is that activating the data is more streamlined, since all the work has already been done to mold the data into a usable format.
In contrast, a data lake skips right to loading the raw source system data into the data lake. There is no requirement to define structure or relationships between data when you load it. This inherently makes data lakes more flexible, since there is much less pre-work involved in getting new data into the lake. However, that work is not gone forever. The onus is now on data engineers to build sophisticated data pipelines that pull the disjointed data out of the data lake and transform it into a format that can be used by the business.
Data warehouses work well with defined, orderly business information. Typically, this information is structured in concept, so the project becomes engineering that conceptual model into the data warehouse as well as the processes that transform and load the source data.
Data lakes work better for housing data that may have unclear business potential or relationships or is at a scale where not all of the data would be useful for analysis. In those cases, businesses opt to just get the data into the data lake and have it available for data engineers to later build a pipeline that can produce a usable format for a given use case.
Several challenges arise with both solutions. Both introduce operational overhead with added development, maintenance, and upkeep. We may have gotten the data out of the siloed systems, but in order to do so, we had to engineer data structures and transformations to neatly warehouse the data. Or alternatively, we had to engineer sophisticated data pipelines to take loosely structured data and process it into a usable format.
Another risk of this strategy is it introduces a new source of truth system that is abstracted away from the originating source of data by complex transformation logic. This can cause risks to data integrity.
Lastly, with data warehouses and lakes, you commonly have to forsake access to real-time data, given the complexity of transforming and transferring the data. As you scale your business and your systems, the complexity, technical debt, and risk of failure these data strategies pose will only become more of a problem.
Instead of lifting the data out of source systems and storing it somewhere else, why not just connect to the sources of data directly? Well, that’s easier said than done. Your ERP and CRM systems may have a great deal of conceptual overlap, but often they are supported by different technologies and have no native way of connecting their data structures.
In the past, businesses would go all in on a single technology vendor just to address this connected data gap, which inevitably leads to making certain sacrifices. Even if the data is located in the same system provider, it still fails to connect to other modern or legacy systems in your enterprise.
This is where strategies like data fabric and data mesh come in and deliver value. Data fabric and data mesh are architectural approaches that allow you to keep data in your source systems, access it in real time, and connect it across different systems. Both strategies have similarities, but also important differences.
Data mesh as a concept came about with recent revolutions in software architecture. The industry has trended toward breaking up monolithic services into independent microservices. Microservices can make development more agile. However, this introduced a need to orchestrate, manage, and connect information and actions across microservices. By creating API integrations between these different microservices, they could stay connected and work together. Scaling this concept up to the enterprise, entire systems could be integrated with one another to achieve an enterprise data mesh.
The catch with the data mesh approach is twofold. One, you are trading sophisticated data engineering work for sophisticated software engineering work. To implement and leverage these APIs, you need to have the right skills, the right information about how the integrations work, and the right tools for each integration. Despite the effectiveness of the data mesh architecture, only specialists can make use of it. In other words, data mesh is a high-code approach requiring developer expertise and time.
The second catch with data mesh relates to centralized governance. With data warehouses and data lakes, you can get a full view of your replicated data landscape in one system. With a data mesh, the API integrations are distributed across systems, so you only see the patterns people have already created with the data mesh.
Data fabric offers compelling ways to overcome both of these challenges.
A data fabric includes a virtualization layer, a concept that you may also see referred to as a “logical data warehouse.” This means that with a data fabric, disparate system data is virtualized into a centralized platform, complete with the ability to connect, relate, and extend data. You might also think of data fabric as an abstraction layer for managing your data. A key point to remember about using a data fabric: the data stays where it is; it does not move out of the source systems.
With data fabric, we do not need to hook into the system-to-system API calls directly in order to access data—the APIs are abstracted away. This abstraction lets us take advantage of the data in different systems without needing to know what the source system is or how to connect to it. The data may be on-premises or it may be in a cloud service like AWS as part of your hybrid cloud strategy.
Whereas data mesh requires software specialists, data fabric enables any number of line-of-business people on your teams to work with data modeling—not just developers. That means non-technical employees can use low-code tools to do data modeling work themselves, which leads to increased speed and agility.
What about the governance challenges? As noted earlier, data mesh poses challenges related to observability and maintenance because of its distributed nature. In contrast, data fabric is centralized. With data fabric keeping all your data in one virtualized data model, you get a complete, unified view of all your different systems. Even if certain patterns have not been used before, relating the data in the virtualized model allows for new modes of data access to be implemented easily and in a governable way.
Here’s a hint for discussing data fabric with others: The term “data fabric” can refer either to the architecture layer where the data virtualization happens or to the toolset that you use there.
The use of a data virtualization layer is a strong value add on its own. But this value increases greatly when you marry your virtualized data model with your business applications on a process automation platform with low-code capabilities and record-level security. For example, using low-code security rules, you can reference data in your CRM to enforce whether specific rows of data from your ERP should be accessible. You can also calculate custom data fields, like SLAs, by referencing customer data and case data, even if they aren’t located in the same system. Features like these allow you to maximize your business potential without forsaking your existing systems or technologies. This approach also builds in flexibility for the future.
When it comes to enterprise systems, change is not only constant, but accelerating. As organizations apply digital transformation to more and more facets of their businesses, the technology strategies they use will need to be more flexible, scalable, and maintainable than ever before. Agility and speed are competitive mandates.
While data warehouses, data lakes, and data meshes have served well in the past, data fabric will be what carries companies into the future. By combining virtualized data, business applications, and no-code data modeling into a single platform, companies will be able to turn their technology landscape into a differentiator rather than a burden.