What is Metadata? (Data about data)

Kashif Rabbani
16 min readJul 21, 2020
Photo by Clay Banks on Unsplash

Abstract

In the following document, I will provide a basic review of Metadata. This review will start with the motivation and definition of Metadata. Metadata definition requires to build some foundation knowledge about maps along with background from history. Metadata literature will be extended to the big picture view focusing on three core features reflected by metadata. After covering metadata literature, I will explain a few important day-to-day metadata terminologies, metadata standards topology, and top-notch metadata types used in current-state-of-the-art. Finally, I will conclude by explaining domain-specific metadata standards in five particular domains, the use of metadata and takeaway message. Most of the concepts explained in this report are based on [3][4].

Keywords: Metadata · Metadata Standards · Metadata Topology · Metadata Terminologies · Metadata Types · Metadata domains · Use-of- metadata

1- Introduction

Metadata is a term widely used in data science nowadays. Most often this term is misunderstood due to lack of appropriate knowledge. Philip Bagley[1] coined the term Metadata for the first time in November 1968. The idea of the concept metadata belongs to the first library thousands of years ago.

The 1st catalog created for the Library of Alexandria in the year 245BC was called Pinakes ( in Ancient Greek). It was invented to sort out the critical issue of finding the relevant book of interest quickly. As an analogy, it was more like VHS-Tape scrolling technologies we had in the past. Attributes used in these catalogs were the same as being used in today’s libraries e.g., title, genre, and author.

The 2nd invention in the field of library catalog developments was Codex. It was also called the shelf-list (the book). 3rd and the most revolutionizing invention was Card Catalogs, invented at the time of the French revolution. Card catalog atomized the shelf-list in two dimensions. 1) records for individual items and 2) Headers/categories shared by the data items if we think about it again, by breaking the data into records (individual items) and categories that are shared by the data items, you essentially invent a spreadsheet. This atomization in two dimensions led us to the invention of the databases later.

1.1 Driving Towards Metadata

Let’s build some basis to come up with a technical definition of metadata. Ac- cording to the theory of Alfred Korzybsk (An American scholar recognized as a founder of general semantics),

The map is not the territory.

We encounter different types of maps in our daily life, for example, the most used road maps (Google Maps), topographical maps, and nautical charts. All these different types of maps are entitled to serve a different specific purpose, and possibly they are not interchangeable. The commonality among these maps is that all these maps simplify the copiousness and complexity of the physical world into the details that one can need in a specific situation. Precisely, these maps serve as a Language to reduce the daily life’s complexities. For example, we do not need topographic (information about the shape and features of land surfaces) when planning a road trip, we only require weather and traffic/road information. Thus we can say that the map is a separate (simple) object of the territory. Hence we conclude that Metadata is a map. It is a way to simplify the complexity of an object.

When a task is being performed well by the metadata, its existence fades away into the background. As an elementary example, every piece of information we get while backtracking our memory to find out the lost keys of our house is metadata.

1.2 Defining Metadata

A short and very well known definition of metadata on the internet is “data about data.” The definition of data is different for everyone and the term “about” itself is not very clear. Therefore, we need to elaborate on this definition technically. Mr. Jeffrey Pomerantz in his recent book ”Metadata” at MIT came up with the following definition:

Metadata is a Statement about a Potentially Informative Object.
– Metadata Book by Jeffrey Pomerantz- MIT

Information objects have three features reflected through metadata. Content, Context, and Structure of the information object. What does the object contain? What are the “W” aspects of the object’s creation? Moreover, what is the structure of the object? Metadata answers these questions at any level of aggregation (single, list, databases).

1.3 Metadata Terminologies

The information object is known as the Resource. To make a statement, we have a resource to say something about it. What we say about the potential information object is known as Description. A metadata schema defines multiple rules set. These rules state what kind of statements are allowed to make about the resource and how to make such statements. The first metadata schema designed to express the description of any resource was Dublin Core. We will talk about it in detail in the next section. A piece of data assigned to an element is called value. An Element in a metadata schema is a category of statements about the resource. The most used terminology is metadata record, it is a set of statements about a single resource. There are two types of vocabularies often used in metadata standards. Uncontrolled vocabulary is an infinite set of terms that suggest a value for an element while the Controlled vocabulary is an organized finite set. Figure 1 represents the flow of Resource to the description.

Fig. 1: Metadata Terminologies

2- Metadata Standards

There are hundreds of metadata standards available for different domain-specific areas. However, this report does not aim to overwhelm readers with metadata standards. A topology of metadata standards is formed to illustrate the importance and existence of metadata standards.

Standards are like toothbrushes, everyone agrees that they’re a good idea, but nobody wants to use anyone else’s.
– Attributed to Murtha Baca, Getty Research Institute

Data Structure Standards are based on sets of metadata elements and schemas. These standards are containers of data about the information object,

e.g., Dublin Core Metadata Element Set (DCMES), MARC, EAD, CDWA, VRA, etc.

Data Value Standards are based on controlled vocabularies. Such standards represent terms/values used to populate data structure standards or sets of metadata elements. E.g., LCSH, LC Thesaurus for Graphic Materials (TGM), ULAN, TGN, ICONCLASS, etc.

Data Content Standards posses cataloging rules and codes. Before-mentioned standards form the basis of guidelines for formats and syntax rules of the data values used to populate the metadata elements: E.g., Anglo-American Cataloguing Rules (AACR), RDA, ISBD, DACS.

Data Format and Technical Interchange Standards are in machine-readable form. These standards represent the manifesto of a specific data structure standard, which is encoded for machine-level execution. There is a long list of examples, but most common are MARC21, MARCXML, EAD XML DTD, METS, MODS, CDWA Lite XML schema, Simple Dublin Core XML schema, Qualified Dublin Core XML schema, VRA Core 4.0 XML schema.

3- Metadata Types

Perhaps the most famous and widely used type of metadata is descriptive metadata. However, this is not the only type in the market. Different communities perceive metadata from different angles and thus come up with a new type of metadata or metadata standards. We will discuss eight different types of metadata in the details below.

3.1 Descriptive Metadata

The very first metadata known as ’Dublin Core’ was categorized as descriptive metadata. In November 1993, the National Center for Supercomputing Applications (NCSA) released a first web application to display both the images and the text simultaneously. Indeed it was a major step in the World Wide Web (WWW), but within the very next two years in early 1995, HTTP, FTP, and Telnet (To Transfer Data) took the market. In March 1995, Online Computer Library Center Inc Dublin, Ohio (OCLC) and NCSA called an invitation-only workshop. The main agenda was to discuss the ”metadata for the

web.” There was no search engine available at that time. Not even Google and Yahoo. The goal of the workshop was to somehow reach the consensus on a core set of metadata

elements to describe the web and network resources. The point of discussion was the importance of descriptive metadata for the success of web search tools. Fif- teen elements were introduced known as Dublin Core Elements shown in figure

2. These elements can be extended to other metadata standards. Each element is a statement stated about a resource, e.g., the element creator will express the intellectual property of the potential informative object.

Let’s write metadata about Apple Inc. Title: Apple Inc, Creator: Steve Jobs, Creator: Steve Wozniak, Creator: Ronald Wayne, Date: April 1976, description: Personal Computers Manufacturer.

Fig. 2: Dublin Core Metadata Elements

Dublin Core Analysis: As Dublin Core metadata standard elements were defined as a core. It is essential to analyze the success of something which is defined as a core. The reason is that the audience expects that core should be adaptable and extendible without much cost. Cost includes financial cost, time, and risk. In figure 3 you can see the consumption of housemade things over the last century to get an idea of the rate of adoption. To analyze the success of Dublin Core, we should recall the objective of 1995 OCLC Dublin Workshop. It was about the importance of metadata in the success of web search engines. Successful search engines like Google and Yahoo came into existence by making use of full-text searching approaches by taking advantage of network structure and other web features. Hence the purpose of Dublin Core seems to go shallow here because search engines did not make their foundations based on Dublin Core metadata standards. Should we declare Dublin Core as a failure now? No, the first initiative to implement the RDF data model was because of Dublin Core. Most famous RDF data models are the Digital Public Library of America, Europeana, and DBpedia.

DBpedia aims to extract information from the Wikipedia project. This structured information is stored in the form of RDF. It is available on the World Wide Web. It allows querying Wikipedia resources semantically to get details about their relationships and properties and links to other RDF ontologies. It is also known as one of the best efforts of decentralized Linked Data.

Europeana was started to preserve the European cultural heritage in digital format. The most famous Mona Lisa painting by Leonardo da Vinic is one of the examples of Europeana. Europeana got contributed by more than 3000 institutes. Europeana lets users explore the European culture and scientific heritage.

Fig. 3: Consumption of Households in last century [4]

The nice thing about standards is that there are so many of them to choose from. — Admiral Grace Hopper

3.2 Administrative Metadata

It provides information about the complete lifecycle of a resource. This information is used in the administration of the resource. The administrative type of metadata is a huge umbrella. It covers three main types of metadata, i.e., Preservation, Rights, and Technical Metadata. Managing a resource requires every little piece of information to be stored and analyzed in a way that is both, useful and extendible at the same time. We will discuss three types of metadata under the hood of administrative metadata in the subsequent subsections.

Rights Metadata It provides information about access control rules and regulations of a resource. Digital resources most often suffer from the issue of copyrights. A schema to capture the data about rights of the resources; remember the “rights” element of Dublin Core. Dublin core standards get extended with three more elements. 1) Access Rights: Policies and rights for the holder to access the resource, 2) Rights Holder: It can be an individual or an organization,

2) License: It is a legal document.

Preservation Metadata Ensuring the existence or aliveness of a resource throughout the life cycle of a process requires supporting information that can only be provided by Preservation Metadata. Preservation Metadata Implementation Strategies (PREMIS ) schema is the most fully developed metadata schema to support the preservation of the resource. In other words, preservation metadata is the information used by a repository (e-resource collection) to guard the process of digital preservation. E.g., If the process description is to store the specific type of medicine in an environment having 25 percent relative humidity, its PREMIS diagram maps to the architecture shown in figure 4.

Fig. 4: PREMIS Component Diagram [4]

Technical Metadata It addresses the system-level technical details about the functionality of a resource. The most common example is digital photography.

Modern smartphones and digital cameras automatically generate rich metadata records and embed them with each captured photograph (image file). Exchange- able image file format, also termed as ”Exif ” is one of the well-known metadata schema used by most of the modern digital devices. Figure 5 shows Exif metadata schema used in Canon EOS.

Fig. 5: Exif Metadata Schema [4]

3.3 Structural Metadata

We are habitual of watching digital videos. Structural Metadata plays a significant role behind the digital curtains. MPEG-21 is an ISO standard for digital videos. MPEG-21 provides an open framework for applications to incorporate multimedia files. Heart of MPEG-21 is a digital item. A structured digital object, e.g., a movie includes videos, audio tracks, and images organized in a specific way. In this case, the movie is a resource, and structural metadata is responsible for capturing the information about its organization. MPEG-21 provides information about the correct playlist order besides the video items.

3.4 Provenance Metadata

It is impossible to track the end-to-end history of a resource having information about its related entities. Provenance metadata provides a mechanism to track the data about the entities and cross-relationship of other entities with the resource. Provenance metadata is a method to determine the position and provide a context of a resource in a social network. E.g., Wiki is storing every edit made to any of its pages. It leverages wiki users to go through the historical timeline of a page along with information about editors (IP addresses at least) and comments.

3.5 Meta-Metadata

Metadata Encoding and Transmission Standard (METS ) started in early 2000 as a result of an enormous increase in data from digital resources like libraries, museums, archives, and cultural heritage. It resulted in an exponential increase of metadata schemas and standards for the resources mentioned above. Popular repositories which came into existence includes arxiv.org, Fedora, eprints, and Dspace. Few of these resources are still up to date and well known. It started the problem of reproduction of content and functionality of the data. METS provided a standard structure for metadata about resources and ensured data exchange among different repositories to solve this problem.

METS creates documents for metadata records. A METS document is a mechanism to read several relationships that exists between digital library object and pieces of contents. There are seven parts of the METS document. The Header, Descriptive metadata, Administrative metadata, Structural Map, Structural link, Behavior, and comparison analysis.

4- Domain-Specific Metadata

Metadata is everywhere, but few of the most public areas are HealthCare, Environmental, GeoSpatial, Education, Music Industry, and the Automobile industry. We will discuss each domain in detail below.

4.1 Music Industry

We all love music. Music industries are focusing on releasing new unique types of music by making use of the latest research tools and technology. Pandora* a popular online music service is making extensive use of metadata. Descriptive metadata is currently an active area of development in the classical music industry. Music Genome Project is the heart of the Pandora service. It consists of around 450 features to describe a piece of music. These features are elements of metadata schema. Pandora has hired a team of musicians to do this job. This team is responsible for listening to every song licensed under Pandora and map the characteristics of each song over the features of the Music Genome Project. Some of the features are keys, tempo, beats per minute, and the gender of the vocalist, etc. Evolution in the music industry is mostly because of genre and technology. Metadata is the best way to keep track of this evolution.

* https://www.pandora.com/

4.2 Education

Education is a broad field, and there are plenty of learning resources available online to facilitate the learning pathways. Metadata comes into the picture when we need to standardize the learning objects. The Institute of Electrical and Electronics Engineers (IEEE) announced the standard for Learning Object Metadata (LOM) to describe the learning objects in 2002.

Another aspect associated with the process of learning is teaching. Learning objects support both teaching and learning around a single learning objective. As most of these learning resources are in the form of digital resources, therefore it is easy to standardize their distribution to one meta-body. LOM defines the set of categories. Each category contains a specific set of elements. As a result of this initiative, many higher education systems adopted LOM. E.g., learning management systems (LMS) used in K-12 2. LOM categories include the Educational category comprised of set TypicalAgeRange, TypicalLearningTime, and Rights Category comprised of Copyright element.

4.3 Transcripts

As the heading does not convey much about this domain, we need to dig into its essence. Educational institutes are providing transcripts/degrees/certificates to the students. The fact that not every institute of the state is interlinked with each other. A reliable way to avoid the verification of transcripts via physical mail was the necessity of time. Parchment 3 is a company making use of metadata for developing schemas to represent degree programs and courses of students in a well-structured way. This area has got standardization in higher education recently. Parchment will facilitate the verification of transcripts across different institutes and companies by enabling easy import and export of student’s transcripts and credentials.

4.4 Publishing

Publications and descriptive metadata are interrelated for many decades. Traditionally it only consisted of publisher details, publication date, ISBN, etc. But now, with the arrival of ebooks and self-publishing platforms in the online world, it has gotten the eyes of the audience. Amazon kindle direct publishing and Lulu are few of the modern self-publication platforms. It has been observed that the quality and richness of metadata related to these publications is critical, despite the readers discover the title or not.

2 https://en.wikipedia.org/wiki/K%E2%80%9312

3 https://www.parchment.com/

4.5 Geospatial Metadata

Maps nowadays are in everyone’s pocket. Most of the businesses are also making use of maps to visualize different aspects of business projects. Geospatial elements are making extensive use of metadata. Geospatial metadata describes maps, Geographic Information System (GIS) files, Imagery, and other location-based resources. Metadata is a part of the dataset, and it provides context to the metadata.

Metadata contains information about the data’s origin, custodianship, copyrights, and reuse. Metadata is now widely used in spatial data communities for sharing/transferring the information. Geographic metadata is responsible for making users aware of geographic data’s limitations, suitability, indexing, and restrictions.

Geospatial Metadata Standards First geographic metadata standard ISO- 19115 was released by ISO Geo metadata in 2003. Later on, it got endorsed by the Federal Geographic Data Committee (FGDC) [2] in 2010. ANZLIC4 (The Spatial Information Council) is providing geospatial metadata standards for Australia and New Zealand.

Geospatial Metadata Tools There are few well known Geographic Information Systems (GIS) systems, e.g., ArcGIS, PYCSW, and OSGeo. ArcGIS5 is the most famous and widely adopted by industries because it enables users to create and use geo maps, compile geographic data, share and manage the geoinformation. PYCSW6 and OSGeo7 are in use as a framework to manage and create geospatial data.

5- Use of Metadata

As metadata is doing its job well in the background, we need to look at it from another perspective. Our last phone call details, online purchases, money transactions, and web surfing are also creating a lot of metadata. We are not aware of the fact that the same metadata which can be used to detect frauds can be used against you. Similarly, the decisions taken by these big giant companies and institutes such as Amazon, Banks, Armies, and Agencies are also influencing the lives of people abruptly.

Former director of NSA and CIA ’General Michael Hayden’ once made a statement ”We kill people based on Metadata” in a panel debate at Johns Hop- kins University in 2004. The type of metadata used by NSA and CIA is about individuals and their networks. These agencies collect phone calls data from carrier companies directly and analyze it by combining it with the metadata obtained from other resources and make a decision about an individual. This process of utilizing metadata until making a decision is illustrated animatedly in figure 6.

Fig. 6: Metadata use by Authorities

5.1 Data Exhaust

With the advent of smart technologies and apps, we are producing too much data in our day-to-day activities. This data can be collected and used by the authorities (as explained earlier) or the providers, e.g., Amazon. Data Exhaust is a by-product of other involved processes. Data exhaust is not like metadata, which is created deliberately but is produced as a result of other activities incidentally. Figure 7 shows an illustration of data exhaust’s example. A famous online e-commerce company once sent a flyer related to baby-care items to a customer, which revealed the pregnancy of the minor customer to her parents before she informed them. This decision was based entirely on analyzing the purchase pattern of the customer.

Fig. 7: Data Exhaust Example

5.2 Paradata

This term is mostly used for metadata about learning resources. Learning resources include education and research. In the context of education, Paradata is about educational resources, and in the context of research methodology, it is

mostly used to create metadata records and schemas for large datasets used in the extensive experiments, which are sometimes confidential. For example, metadata records about the origin of the dataset and timeline of data collection and utilization.

6- Conclusion

We have discussed a few possible aspects of the metadata in this report. Nevertheless, we can not capture all the details due to the scope of the topic. Metadata is thriving in every domain. Metadata has good and bad aspects, depending on the type of usage. The author of the report has declared metadata as a parasite and a matter of perspective. Do we never know if it is there? If it is harmful or not? What is its type? What are its usage and characteristics?

References

1. History of information (1968), http://www.historyofinformation.com/detail.php?entryid=4241

2. Fgdc (2010), https://www.fgdc.gov/metadata

3. LePage, A.: Introduction to metadata.

4. Pomerantz, J.: Metadata. MIT Press (2015)

--

--

Kashif Rabbani

I am a data science PhD researcher who loves to write!