Metadata is a concept that has mostly avoided the “data limelight” surrounding Big Data, Predictive Analytics, and so on. It’s easy to understand why: it isn’t sexy, and people aren’t exactly solving all of the world’s ills through clever application of metadata. Why, then, should you give metadata a second thought? Maybe we’re just old fashioned, but we always stress with our clients getting the foundational aspects of data right are what enable all the flashy, sexy, newsworthy stuff to work. Metadata is one of several of these foundational practices that allow for and strengthen a data strategy.
At its most basic, metadata is “data about data.” Often metadata is what makes unstructured data searchable, sortable, or comparable. Consider the photos on your phone. Now think about all the data about them that aren’t part of the photograph: geo-location, date taken, identified faces, device and setting information. This information helps us categorize and contextualize the photographs. The same is true for enterprise metadata describing data in other systems: author, date created, date edited, source system or other lineage information, data type, field length, etc.
Providing contextual information around data to data analysts, data architects, and DBAs can significantly reduce the time spent on impact analysis. Consider a situation where every time a change order comes through to edit a column in a table. With a well curated metadata paradigm, the data lineage could be instantly reviewed: what system is that data sourced from? What ETL processes change or massage that data? What utilizes the data downstream? It’s time saved, and added protection from unforeseen impacts.
Context in less structured environments is important too. If you’ve got a data lake, performing ad hoc analysis will take much less time with curated metadata. A data scientist who can spend less time crunching data is a happy data scientist.
Technical metadata offers several key benefits:
- Provides context to important data
- Tracking data lineage for impact analysis
- Limiting redundant data rework
- Simplifying integrations
- Assisting analysts in finding information
There are several paradigms for building a metadata architecture, but most share many common components. These components include:
A metadata sourcing and integration layer. This component is frequently a combination of automated and user-generated metadata, and may be achieved via a specialized application, ETL processes, or other methods. The output of this layer is the creation, sourcing, and integration of the metadata.
A metadata repository. The repository is responsible for storing the metadata. The two major paradigms for repositories are centralized and de-centralized. Frequently, a de-centralized metadata repository will act more like a “registry” where it only serves to track the location of metadata (which would in turn be managed in other systems). A centralized repository will store integrated metadata, with more clearly defined relationships.
A metadata management interface. An interface where metadata, its associated business rules, and other administrative information can be managed by data stewards.
A metadata delivery layer. This is the end result of the metadata architecture, and provides end users with the capability to drive a decision support system, and perform impact analysis.
Investing in metadata can be a difficult sell for stakeholders. The benefits, however, can save time developing and reduce the downstream impact of unforeseen dependencies. If your organization struggles with complex data models, or invests far too much time integrating data in an ad-hoc fashion, consider doing an assessment to understand how building a metadata management capability could benefit your organization.