Skip to Main Content

Gibson D. Lewis Library Libguides

Research Data Management

Metadata Basics

A simple definition of metadata is the data you provide about your data. Good metadata is extremely important as it makes the research data understandable in the future for you or for other researchers who want to use your data and interpret your findings. Different fields of study, repositories, or grant funding may suggest or require a specific set of metadata standards be used when sharing research data. You can learn more about metadata standards for different disciplines using the Disciplinary Metadata Guide (via the Digital Curation Center) or the Metadata Standards Catalog (community-driven project managed by the Research Data Alliance).

If your discipline does not have any metadata standards, another way to provide metadata is to write your own README files.

Blue computer folders on a black background, interconnected computer symbols, and computer script

What is a README file?

README files instruct the user to read the file first to make sense of associated files, project folders, data, or code. They are files saved in plain text formats (.txt or .md) so they can be accessed by anybody and do not need proprietary software to read. A README should be created for each folder level alongside other research files and should describe:

  • the content within the folders
  • how content relates to other content
  • how to interpret data files or code
  • how the data was generated, processed, or transformed
  • any restrictions to view or access the content
  • other instructions to better understand the research data files

Use a README template

There are many templates you can find online to download and adapt for your research project. Download a template below, then read about good practices and minimum requirements when creating a general README file.

Write a README

A hand holding a check mark.If you have never written a README file or it has been a long time since you created your last one, read the following for general best practices to get started.

 

  • Create a READMEs using logical "clusters" of related files or data. This can mean:
    • One document for a dataset that has multiple, related, similarly formatted files, or files that are logically grouped together for use
    • One document for a single data file depending on the size or complexity of the data
    • Multiple READMEs for a larger dataset.
  • Name the README so it is easily associated with the file(s) it describes.
  • README documents should be a plain text file, but should also be formatted.
    • Use spatial formatting so information is easy to understand (e.g. separate important pieces of information with blank lines, rather than having all the information in one long paragraph).
    • Format multiple README files identically, presenting the information in the same order and using the same terminology.
    • Use standardized date formats like the international standard notation of YYYY-MM-DD or YYYY-MM-DDThh:mm:ss.

*Adapted from Princeton Research Data Service and Cornell Data Services.*

Two hands holding a ribbon with a check mark.The following are recommended content when writing a README file for optimal sharing and data reuse. It is important to fill out as much metadata as possible depending on your research.

Minimum recommended content for data re-use is in bold.

 

General information
  • Dataset title
  • Name/institution/address/email information/ORCiD for
    • Principal investigator (or person responsible for collecting the data)
    • Associate or co-investigators
    • Contact person for questions
  • Date of data collection (single date or a range)
  • Information about geographic location of data collection
  • Keywords used to describe the data topic
  • Language information
  • Information about funding sources that supported the collection of the data
Data and file overview
  • For each filename, a short description of what data it contains
    • NOTE: When working with a large number of files, a short description about what each collection of similar files contains may be best
  • Format of the file if not obvious from the file name
  • If the dataset includes multiple files that relate to one another, the relationship between the files, or a description of the file structure that holds them (possible terminology might include "dataset" or "study" or "data package")
  • Date that the file was created
  • Date(s) that the file(s) was updated (versioned) and the nature of the update(s), if applicable
  • Information about related data collected but that is not in the described dataset
Sharing and access information
  • Licenses or restrictions placed on the data
  • Links to publications that cite or use the data
  • Links to other publicly accessible locations of the data
  • Recommended citation for the data
Methodological information
  • Description of methods for data collection or generation (include links or references to publications or other documentation containing experimental design or protocols used)
  • Description of methods used for data processing (describe how the data were generated from the raw or collected data)
  • Any software or instrument-specific information needed to understand or interpret the data, including software and hardware version numbers
  • Standards and calibration information, if appropriate
  • Describe any quality-assurance procedures performed on the data
  • Definitions of codes or symbols used to note or characterize low quality/questionable/outliers that people should be aware
  • People involved with sample collection, processing, analysis and/or submission
Data-specific information (Repeat this section as needed for each dataset or file, as appropriate)
  • Count of number of variables, and number of cases or rows
  • Variable list, including full names and definitions (spell out abbreviated words) of column headings for tabular data
  • Units of measurement
  • Definitions for codes or symbols used to record missing data
  • Specialized formats or other abbreviations used

*Adapted from Princeton Research Data Service and Cornell Data Services.*