Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more
This article was contributed by Kumar Goswami, CEO, and cofounder of Komprise.
For decades, managing data essentially meant collecting, storing, and occasionally accessing it. That has all changed in recent years, as businesses look for the critical information that can be pulled from the massive amounts of data being generated, accessed, and stored in myriad locations, from corporate datacenters to the cloud and the edge. Given that, data analytics – helped by such modern technologies as artificial intelligence (AI) and machine learning – has become a must-have capability and in 2022, the importance will be amplified. Enterprises need to rapidly parse through data – much of it unstructured – to find the information that will drive business decisions. They also need to create a modern data environment in which to make that happen.
Below are a few trends in data management that will come to the fore in 2022.
Data lakes get more organized, but the unstructured data gap still exists
There are two approaches to enterprise data analytics. The first is taking data from business applications such as CRM and ERP and importing it into a data warehouse to feed BI tools. Now those data warehouses are moving to the cloud, with technologies like Snowflake. This approach is well understood, as the data has a consistent schema.
The second approach is to take any raw data and import it directly into a data lake, without requiring any pre-processing. This is appealing because any type of data can be funneled into a data lake, and this is why Amazon S3 has become a massive data lake. The trouble is, some data is easier to process than others. For instance, log files, genomics data, audio, video, image files and the like don’t fit neatly into data warehouses because they lack a consistent structure, which means it’s hard to search across the data. Because of this, data lakes end up becoming data swamps: it is too hard to search, extract and analyze what you need.
The big trend now and a continuing data trend for 2022 is the emergence of data lake houses, made popular by DataBricks, to create data lakes with semi-structured data that does have some semantic consistency. For example, an Excel file is like a database even though it isn’t one, so data lake houses leverage the consistent schema of semi-structured data. While this works for .csv files, Parquet files, and other semi-structured data, it still does not address the problem of unstructured data, since this data has no obvious common structure. You need some way of indexing and inferring a common structure for unstructured data, so it can be optimized for data analytics. This optimization of unstructured data for analytics is a big area for innovation, especially since at least 80% of the world’s data today is unstructured.
Citizen science will be an influential, related 2022 trend
In an effort to democratize data science, cloud providers will be developing and releasing more machine learning applications and other building block tools such as domain-specific machine learning workflows. This is a seminal trend, because, over time, the level of what individuals will need to code is going to decrease. This will open up machine learning to many more job roles: some of these citizen scientists will be within central IT, and some will live within lines of business. Amazon Sagemaker Canvas is just one example of the low-code/no-code tools that we’re going to see more of in 2022. Citizen science is quite nascent, but it’s definitely where the market is heading and an upcoming data trend for 2022. Data platforms and data management solutions that provide consumer-like simplicity for users to search, extract and use data will gain prominence.
‘Right data’ analytics will surpass Big Data analytics as a key 2022 trend
Big Data is almost too big and is creating data swamps that are hard to leverage. Precisely finding the right data in place no matter where it was created and ingesting it for data analytics is a game-changer because it will save ample time and manual effort while delivering more relevant analysis. So, instead of Big Data, a new trend will be the development of so-called “right data” analytics.
Data analytics ‘in place’ will dominate
Some prognosticators say that the cloud data lake will be the ultimate place where data will be collected and processed for different research activities. While cloud data lakes will assuredly gain traction, data is piling up everywhere: on the edge, in the cloud, and in on-premises storage. This calls for the need to in some cases process and analyzes data where it is, versus moving it into a central location because it’s faster and cheaper to do so. How can you not only search for data at the edge, but also process a lot of it locally, before even sending it to the cloud? You might use cloud-based analytics tools for larger, more complex projects. We will see more “edge clouds”, where the compute comes to the edge of the datacenter instead of the data going to the cloud.
Storage-agnostic data management will become a critical component of the modern data fabric
A data fabric is an architecture that provides visibility of data and the ability to move, replicate and access data across hybrid storage and cloud resources. Through near real-time analytics, it puts data owners in control of where their data lives across clouds and storage so that data can reside in the right place at the right time. IT and storage managers will choose data fabric architectures to unlock data from storage and enable data-centric vs. storage-centric management. For example, instead of storing all medical images on the same NAS, storage pros can use analytics and user feedback to segment these files, such as by copying medical images for access by machine learning in a clinical study or moving critical data to immutable cloud storage to defend against ransomware.
Multicloud will evolve with different data strategies
Many organizations today have a hybrid cloud environment in which the bulk of data is stored and backed up in private datacenters across multiple vendor systems. As unstructured (file) data has grown exponentially, the cloud is being used as a secondary or tertiary storage tier. It can be difficult to see across the silos to manage costs, ensure performance and manage risk. As a result, IT leaders realize that extracting value from data across clouds and on-premises environments is a formidable challenge. Multicloud strategies work best when organizations use different clouds for different use cases and data sets. However, this brings about another issue: moving data is very expensive when and if you need to later move data from one cloud to another. A newer concept is to pull compute toward data that lives in one place. That central place could be a colocation center with direct links to cloud providers. Multicloud will evolve with different strategies: sometimes compute comes to your data, sometimes the data resides in multiple clouds.
Enterprises continue to come under increasing pressure to adopt data management strategies that will enable them to derive useful information from the data tsunami to drive critical business decisions. Analytics will be central to this effort, as well as creating open and standards-based data fabrics that enable organizations to bring all this data under control for analysis and action.
Kumar Goswami is the CEO and cofounder of Komprise.
DataDecisionMakers
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!