Skip to Main Content

Managing Research Data for Public Access: FAQs and Definitions

What is a data management plan?

"A data management plan (DMP) is a written document that describes the data you expect to acquire or generate during the course of a research project, how you will manage, describe, analyze, and store those data, and what mechanisms you will use at the end of your project to share and preserve your data." —Stanford Libraries: Data management plans

Do I really have to keep and share all of my data?

The short answer is "no," but you are required to retain, share and make accessible data that validates your research findings. You should also consider preserving/sharing data that:

  • Captures a one-time event.
  • Will be costly, difficult or impossible to replicate.
  • Has long-term value.

What is "research data"?

This guide uses the terms "data" and "research data" interchangeably. The definition of research data used by U.S. DOT is adapted from OMB Circular A-110:

Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples). Research data also do not include:

(A) Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and

(B) Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.

What is metadata?

Meta‚Äčdata, commonly called "data about data," is information that describes data. Good metadata enables others to understand and reuse data that they themselves did not create. A minimum amount of metadata should be agreed upon and implemented before starting data collection. Data collection and documentation is easier if you know what you need to collect and how to record it. This also helps maintain data consistency and quality.

There are many different ways to record and share metadata. Some of the most common methods are:

  • A data dictionary is an effective and concise way to describe the elements or variables that make up your dataset. (For more information, see DataOne's page on Data Dictionaries.)
  • Metadata schema are formal frameworks for recording and describing data. These are often used in projects that involve large, collaborative data gathering or data sharing efforts. U.S. DOT uses the Project Open Data Metadata Schema.
  • Readme files come in a variety of styles. They provide explanations of additional information, often with elements of a data dictionary or metadata schema mixed in. They are the least formal way to document data.

What is a data repository?

Data repositories are devoted to keeping data accessible, safe and secure. They use special software, metadata, workflows and networks to meet these goals. Data repositories also help guarantee authenticity by providing control mechanisms and change logs. For these reasons, repositories are ideal for research data sharing, distribution and preservation.

Data repositories often have limits and restrictions governing which data they accept. Most have rules covering data formats and size limits, and require that data be documented. Some accept data from any research area, while others will only accept research from specific domains (such as biology or social sciences). The latter are known as disciplinary data repositories. Another type of specialized repository is the institutional data repository, which focuses on collecting the outputs of a specific organization, such as a university or federal agency.

See Submit to a Repository for more information and resources for locating data repositories.

What is "machine-readable data"?

Machine-readable data is data that can be read and processed by a computer. By comparison, human-readable data can only be read (and understood) by a human. It is important to understand that charts, graphs and most tables are not machine-readable, but the data they were generated from probably is.

Examples of human-readable data include books, PDFs representations of data (charts, graphs, tables, etc.), and datasets which have not been structured to be read by computers.

Examples of machine-readable data include data that has been encoded with a markup language (HTML, XML, etc.), datasets that have been structured to be read by computers, and data that is encoded for machine processing and is not human-readable.

What does "digitally accessible" mean?

Making data digitally accessible is part of making data machine-readable. There is no clear definition of this term, but it is generally understood that:

  1. The data must be stored in a digital format (i.e., on a computer).
  2. The data should be be available online, be machine-readable, and shared as freely as possible.