Subject Guides: Research Data Management: Metadata Basics

Metadata Basics

A simple definition of metadata is the data you provide about your data. Good metadata is extremely important as it makes the research data understandable in the future for you or for other researchers who want to use your data and interpret your findings. Different fields of study, repositories, or grant funding may suggest or require a specific set of metadata standards be used when sharing research data. You can learn more about metadata standards for different disciplines using the Disciplinary Metadata Guide (via the Digital Curation Center) or the Metadata Standards Catalog (community-driven project managed by the Research Data Alliance).

If your discipline does not have any metadata standards, another way to provide metadata is to write your own README files.

What is a README file?

README files instruct the user to read the file first to make sense of associated files, project folders, data, or code. They are files saved in plain text formats (.txt or .md) so they can be accessed by anybody and do not need proprietary software to read. A README should be created for each folder level alongside other research files and should describe:

the content within the folders
how content relates to other content
how to interpret data files or code
how the data was generated, processed, or transformed
any restrictions to view or access the content
other instructions to better understand the research data files

Use a README template

There are many templates you can find online to download and adapt for your research project. Download a template below, then read about good practices and minimum requirements when creating a general README file.

README Template
Download a README metadata template. This template was adapted from Cornell Data Services (https://data.research.cornell.edu/data-management/sharing/readme/).

If you have never written a README file or it has been a long time since you created your last one, read the following for general best practices to get started.

Create a READMEs using logical "clusters" of related files or data. This can mean:
- One document for a dataset that has multiple, related, similarly formatted files, or files that are logically grouped together for use
- One document for a single data file depending on the size or complexity of the data
- Multiple READMEs for a larger dataset.
Name the README so it is easily associated with the file(s) it describes.
README documents should be a plain text file, but should also be formatted.
- Use spatial formatting so information is easy to understand (e.g. separate important pieces of information with blank lines, rather than having all the information in one long paragraph).
- Format multiple README files identically, presenting the information in the same order and using the same terminology.
- Use standardized date formats like the international standard notation of YYYY-MM-DD or YYYY-MM-DDThh:mm:ss.

*Adapted from Princeton Research Data Service and Cornell Data Services.*

The following are recommended content when writing a README file for optimal sharing and data reuse. It is important to fill out as much metadata as possible depending on your research.

Minimum recommended content for data re-use is in bold.

General information

Dataset title
Name/institution/address/email information/ORCiD for
- Principal investigator (or person responsible for collecting the data)
- Associate or co-investigators
- Contact person for questions
Date of data collection (single date or a range)
Information about geographic location of data collection
Keywords used to describe the data topic
Language information
Information about funding sources that supported the collection of the data

Data and file overview

For each filename, a short description of what data it contains
- NOTE: When working with a large number of files, a short description about what each collection of similar files contains may be best
Format of the file if not obvious from the file name
If the dataset includes multiple files that relate to one another, the relationship between the files, or a description of the file structure that holds them (possible terminology might include "dataset" or "study" or "data package")
Date that the file was created
Date(s) that the file(s) was updated (versioned) and the nature of the update(s), if applicable
Information about related data collected but that is not in the described dataset

Sharing and access information

Licenses or restrictions placed on the data
Links to publications that cite or use the data
Links to other publicly accessible locations of the data
Recommended citation for the data

Methodological information

Description of methods for data collection or generation (include links or references to publications or other documentation containing experimental design or protocols used)
Description of methods used for data processing (describe how the data were generated from the raw or collected data)
Any software or instrument-specific information needed to understand or interpret the data, including software and hardware version numbers
Standards and calibration information, if appropriate
Describe any quality-assurance procedures performed on the data
Definitions of codes or symbols used to note or characterize low quality/questionable/outliers that people should be aware
People involved with sample collection, processing, analysis and/or submission

Data-specific information (Repeat this section as needed for each dataset or file, as appropriate)

Count of number of variables, and number of cases or rows
Variable list, including full names and definitions (spell out abbreviated words) of column headings for tabular data
Units of measurement
Definitions for codes or symbols used to record missing data
Specialized formats or other abbreviations used

*Adapted from Princeton Research Data Service and Cornell Data Services.*