The new metrics for unstructured data management

The new metrics for unstructured data management

Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and achieve efficiency by upskilling and scaling citizen developers. Watch now.



The rate of data growth worldwide in the past few years has been greater than in the previous two decades. Data is predicted to more than double again over the next few years — reaching 175 zettabytes in 2025, according to IDC.

Most of this data is not structured and includes documents, video, images, instrument and sensor data, text and chats and more. Unstructured data is harder to find, move and manage because it doesn’t live in rows and columns in a database but is dispersed across countless applications and storage locations inside and outside of the enterprise.

The explosion of data and the diversity in data types today is bringing a host of new challenges for enterprise IT departments and data storage professionals. These include escalating storage and backup costs, management complexity, security risks, and an opportunity gap from hindered visibility.

To solve these issues, we need new, smart analytics and metrics. These must go beyond legacy storage metrics, to focus on understanding data and involving application owners and department and business stakeholders in data management decisions. These metrics should also include measures to track and improve energy consumption to meet broader sustainability goals, which are becoming critical in this age of cyclical energy shortages and climate change.  

Event

Intelligent Security Summit


Learn the critical role of AI & ML in cybersecurity and industry specific case studies on December 8. Register for your free pass today.


Register Now


First, let’s review what storage metrics IT departments have traditionally tracked:

Legacy storage IT metrics

Over the last 20-plus years, IT professionals in charge of data storage tracked a few key metrics primarily related to hardware performance. These include:

  • Latency, IOPS and network throughput
  • Uptime and downtime per year
  • RTO: Recovery point objective (time-based measurement of the maximum amount of data loss that is tolerable to an organization)
  • RPO: Recovery time objective (time to restore services after downtime)
  • Backup window: Average time to perform a backup

The new metrics: Data-centric versus storage-centric 

The traditional IT infrastructure metrics above are table stakes today for any enterprise IT organization. In today’s world, where data is the center of decisions, there are a host of new data-centric measures to understand and report. Departments and business unit leaders are increasingly responsible for monitoring their own data usage — and often paying for it. Discussions with IT organizations can be contentious when, while IT is trying to conserve spend and free up capacity, business leaders are uneasy about archiving or deleting their own data. These metrics help bridge the gap:

  • Top data owners/users: This can show trends in usage and indicate any policy violations, such as individual users storing excessive video files or PII files being stored in the wrong directory. 
  • Common file types: A research team collecting data from certain applications or instruments may not know how much they have or where it’s all stored. The ability to see data by file extension can inform future research initiatives. This could be as simple as finding all the log files, trace files or extracts from a given application or instrument and taking action on them. 
  • Storage costs for chargeback or showback: Even if a department doesn’t participate in a chargeback model, stakeholders should understand costs and be able to drill down into metrics. This will enable them to identify areas where low-cost storage or data tiering to archival storage can be applied to reduce spend.
  • Data growth rates: Overall trending information keeps IT and business heads on the same page so they can collaborate on new ways to manage explosive data volumes. Stakeholders can drill down into which groups and projects are growing data the fastest and ensure that data creation/storage is appropriate according to its overall business priority.
  • Age of data and access patterns. Most organizations have a large percentage of “cold data” which hasn’t been accessed in a year or more. Metrics showing percentage of cold versus warm versus hot data are critical to ensure that data is living in the right place at the right time according to its business value.

Visibility into data-centric versus storage-centric metrics helps IT and departments make better decisions together. However, these metrics have been historically difficult to gather because of the prevalence of data silos in enterprises, with data spread across many applications and storage environments, from on-premises to the edge and cloud.

Getting this data requires a way to find and index data across vendor boundaries, including cloud providers, using a single pane of glass. Collating data between all your storage providers to get these metrics is possible yet manually intensive and error-prone. Independent data management solutions can help achieve these deeper and broader analytics goals.

The new metrics: Sustainable data management

The global energy crisis, worsened by the war in Ukraine and the surge in demand from the post-pandemic economic recovery, is fueling corporate sustainability programs as well as investment in new green technologies worldwide. Managing data responsibly is no small part of this overall initiative. Most organizations have hundreds of terabytes of data which can be deleted but are hidden and/or not understood well enough to manage appropriately. Storing rarely used and zombie data on top-performing Tier 1 storage (whether on-premises or in the cloud) is not only expensive but consumes the most energy resources. 

Data centers must reduce their climate footprints if we are to mitigate climate change. The sustainability-related data management metrics below can help measure and reduce energy consumption as relates to data storage:

  • Last access time and creation time: Data access and age metrics can inform decisions about moving data to a lower-carbon storage location such as cloud object storage.
  • Duplicate data reduced: Deleting data that is not needed naturally lowers the storage footprint and energy usage. Often, especially in research organizations, datasets are replicated for different experiments and tests but never deleted.
  • Data stored by vendor: Legacy storage technology (RAID, SAN, tape) is more wasteful in general, which is why SSD and all-flash storage has been growing quickly. Newer storage technologies are much faster and efficient than spinning disks, reducing power consumption. Understanding the percentage of data stored on legacy solutions is a starting point toward defining how and when to upgrade to more modern technology, including cloud storage.
  • Easability: This is a measurement of the effort required to perform a function. Any technology that is easier and more efficient to manage is greener. It requires less manpower and fewer data center resources, and has more features for automation. For instance, one storage architect can now manage 50PB of data and up, versus 8PB or less using older technologies.

The new data management

It is true: Investing in new initiatives to expand metrics programs requires time, resources and money. So why do it?

For one thing, having better and more extensive metrics on data can inform cost-effective and sustainable data management decisions — easily cutting spending and energy usage by 50% or more.

But there is more: Your users (data consumers) will also benefit from having detailed insights into their data. Understanding data and being able to quickly search on data characteristics such as file type or metadata tags (like a project keyword) can drastically reduce the amount of time spent searching for data. An estimated 80% of the time spent conducting AI and data mining projects is spent finding the right data and moving it to the right place.

In critical sectors like healthcare, agriculture, government, utilities and manufacturing, there is always a need for faster insights to solve hard problems like creating a new treatment for a chronic condition; improving electric car batteries or wind turbine propulsion; or adjusting soil nutrients to produce a larger yield of crops.

In today’s data-driven economy, basic storage metrics are no longer enough to be competitive and meet vital marketplace and operational goals.

Randy Hopkins is VP of global systems engineering and enablement at Komprise.


DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers