Skip to Main Content

Research Data Management: Data Packages

The Research Data Management Portal is designed to provide guidance, best practices, and resources on the steps within the research data lifecycle and its correlation to the requirements of established data management practices.

What is in a Data Pacakge?

Components of a Data Package (Dataset, README, Data Dictionary, DCAT-US Metadata, Data Management Plan, Code Book)

Data Package

What is a Data Package?

A “Data Package” is the dataset, the data management plan (DMP), and all other documentation needed to contextualize the dataset for any and all users and re-users.

Types of Data Packages:

  • Submission Information Package (SIP): Items that have been submitted by the depositor.

  • Archival Information Package (AIP): A package that contains data that will be stored within a digital archive.

  • Dissemination Information Package (DIP): A package created from the Archival Information Package (AIP) to distribute digital content to users.

Breaking Down A Data Package

​​​​​​Best data file format for long-term preservation of tabular data => .CSV

  • Open, Non-Proprietary File Format
  • Backward and Forward Compatible
  • Long-Term Preservable

Best Practice Tip: In order for data to be machine readable and interoperable, it should have only two types of rows: a single header row, and all others rows are data.

The purpose of the README document is to provide all of that contextual information that could not be included in the data file because it would break the data. 

  • NTL README Template Contents:
    • General Information
    • Sharing Access Information
    • Data and Related File Overview
    • Methodological Information
    • Data-Specific Information and Data Dictionary
    • Appendices

Best Practice Tip: In order for you and future users to know which of the dozens of “README.txt” files on your desktop go with which datasets, use clarifying, human-readable files names, such as: 

bts_osp_national_census_ferry_operators_2016_README_2017_10_26.txt

Data dictionaries store and communicate metadata about data in a database, a system, or data used by applications. Data dictionary contents can vary but typically include some or all of the following:

  • A listing of data objects (names and definitions)
  • Detailed properties of data elements (data type, size, nullability, optionality, indexes)
  • Entity-relationship (ER) and other system-level diagrams
  • Reference data (classification and descriptive domains)
  • Missing data and quality-indicator codes
  • Business rules, such as for validation of a schema or data quality

Data Dictionaries are useful for a number of reasons.

  • Assist in avoiding data inconsistencies across a project
  • Help define conventions that are to be used across a project
  • Provide consistency in the collection and use of data across multiple members of a research team
  • Make data easier to analyze
  • Enforce the use of Data Standards

The DCAT-US Metadata Scheme (.json) is a machine-readable format that is required by both DOT's open data policy and data.gov.

Additionally, a researcher should include any additional metadata schemes that are relevant to the data or field of study. 

For more information on how to create the DCAT-US metadata file and NTL's template please move to the next section on this page.

Importance of Metadata

Metadata is an inextricable part of managing records or digital information in any format. The use of metadata supports methods to identify, authenticate, describe, locate, and manage resources in a precise and consistent way. This precision and consistency in turn allows information providers to meet research, business, accountability, and archival requirements. Additionally, ensuring complete metadata, and updating as needed, is critical to maintaining data quality.

Types of Metadata

There are many different kinds of metadata. In the world of digital objects, metadata is usually divided into 3 to 5 categories: 

  • Administrative metadata is an umbrella term referring to the information needed to manage a resource or that relates to its creation. Within the administrative metadata sphere is:
    • Rights metadata: such as a Creative Commons license, which details the intellectual property rights attached to the content.
    • Technical metadata: information about digital files necessary to decode and render them, such as file type.
    • Preservation metadata: supporting the long-term management and future migration or emulation of digital files, for example, a checksum or hash.
  • Descriptive metadata is information about the content of a resource that aids in finding or understanding it. [It can include elements such as title, abstract, author, and keywords.]
  • Structural metadata describes the relationships of parts of resources to one another; examples include pages in a sequence, a table of contents with pointers to the beginnings of milestone sections, and connecting different resolutions or bit depth representations of identical content.

From: Riley, Jenn. Understanding Metadata: What is Metadata, and What is it For?: A Primer. National Information Standards Organization (NISO). 2017. https://www.niso.org/publications/understanding-metadata-2017 . See page 6.

Best Practices for Metadata

  • Gather all information together, especially if multiple people have information that you need.
  • Use information that is already developed.
  • Choose a descriptive title for your data that incorporates who, what, where, when, and scale.
  • Choose keywords wisely: Consider all of the possible interpretations of your word choices and use a thesaurus or domain-specific controlled vocabulary to add descriptive terms you may not have otherwise selected. 
  • Include as many details as you can in the metadata record for future users of the data.
  • Whenever you change your metadata record, update the metadata date (date stamp) so that metadata repositories will know which version of the record is most recent.

Transportation Research Thesaurus (TRT)

The Transportation Research Thesaurus (TRT) is a tool to provide a common and consistent language between producers and users of transportation information. The TRT covers all modes and aspects of transportation. 

The TRT is a controlled vocabulary that NTL uses when cataloging and creating metadata for both data and reports. 

To learn more information about the TRT and explore the thesaurus check it out at https://trt.trb.org/

A knowledge management document for the data lifecycle. The DMP should be a living document that is created prior to the start of the project and updated throughout to accurately reflect and track the research as it progresses through the data lifecycle. For more information check out the DMP specific page within this LibGuide

Include code books, scripts used during analysis, auxiliary tables, and any other supporting files that were created or used to collect, process, clean, or analyze the data.

Best Practice Tip: Be as complete as possible. The goals of a robust data package include fully documenting your processes  to al-low a naïve user to replicate your results; and, understand the full context of the da-ta, so that they can decide intelligently whether your data meets their reuse need

Using DCAT-US Generator

Create Your Own DCAT-US Metadata File

To the right is an NTL created tool to assist researchers with the creation of the DCAT-US Version 1.1 metadata file, which is required part of every data package (as mentioned above). Using the tool is simple, you just need to feel out the form completely and click on "Generate JSON," at the bottom. This will create a complete DCAT-US metadata file that is ready for submission.

For questions or errors when using the tool, contact NTLDataCurator@dot.gov

DCAT-US Generator

Quiz