Skip to Main Content

Research Data Management: Data Packages

The Research Data Management Portal is designed to provide guidance, best practices, and resources on the steps within the research data lifecycle and its correlation to the requirements of established data management practices.

What is in a Data Pacakge?

Components of a Data Package (Dataset, README, Data Dictionary, DCAT-US Metadata, Data Management Plan, Code Book)

Data Package

What is a Data Package?

A “Data Package” is the dataset, the data management plan (DMP), and all other documentation needed to contextualize the dataset for any and all users and re-users.

Types of Data Packages:

  • Submission Information Package (SIP): Items that have been submitted by the depositor.

  • Archival Information Package (AIP): A package that contains data that will be stored within a digital archive.

  • Dissemination Information Package (DIP): A package created from the Archival Information Package (AIP) to distribute digital content to users.

Breaking Down A Data Package

​​​​​​Best data file format for long-term preservation of tabular data => .CSV

  • Open, Non-Proprietary File Format
  • Backward and Forward Compatible
  • Long-Term Preservable

Best Practice Tip: In order for data to be machine readable and interoperable, it should have only two types of rows: a single header row, and all others rows are data.

The purpose of the README document is to provide all of that contextual information that could not be included in the data file because it would break the data. 

  • NTL README Template Contents:
    • General Information
    • Sharing Access Information
    • Data and Related File Overview
    • Methodological Information
    • Data-Specific Information and Data Dictionary
    • Appendices

Best Practice Tip: In order for you and future users to know which of the dozens of “README.txt” files on your desktop go with which datasets, use clarifying, human-readable files names, such as: 

bts_osp_national_census_ferry_operators_2016_README_2017_10_26.txt

Data dictionaries store and communicate metadata about data in a database, a system, or data used by applications. Data dictionary contents can vary but typically include some or all of the following:

  • A listing of data objects (names and definitions)
  • Detailed properties of data elements (data type, size, nullability, optionality, indexes)
  • Entity-relationship (ER) and other system-level diagrams
  • Reference data (classification and descriptive domains)
  • Missing data and quality-indicator codes
  • Business rules, such as for validation of a schema or data quality

Data Dictionaries are useful for a number of reasons.

  • Assist in avoiding data inconsistencies across a project
  • Help define conventions that are to be used across a project
  • Provide consistency in the collection and use of data across multiple members of a research team
  • Make data easier to analyze
  • Enforce the use of Data Standards

The DCAT-US Metadata Scheme (.json) is a machine-readable format that is required by both DOT's open data policy and data.gov.

Additionally, a researcher should include any additional metadata schemes that are relevant to the data or field of study. 


Importance of Metadata

Metadata is an inextricable part of managing records or digital information in any format. The use of metadata supports methods to identify, authenticate, describe, locate, and manage resources in a precise and consistent way. This precision and consistency in turn allows information providers to meet research, business, accountability, and archival requirements. Additionally, ensuring complete metadata, and updating as needed, is critical to maintaining data quality.

Types of Metadata

There are many different kinds of metadata. In the world of digital objects, metadata is usually divided into 3 to 5 categories: 

  • Administrative metadata, including:
    • Rights metadata (i.e., intellectual property rights and use information)
    • Technical metadata (i.e., technical details about the object and its instantiation like its file format, file size, and how to open, access and use it
    • Preservation metadata (i.e., a log of the series of actions taken against an object in order to ensure it longevity and viability)
  • Descriptive metadata describes a resource, its content, its identifying characteristics and its "aboutness"
  • Structural metadata describes how the pieces of a single object fit together and how an object exists in relationship to other objects

Best Practices for Metadata

  • Gather all information together, especially if multiple people have information that you need.
  • Use information that is already developed.
  • Choose a descriptive title for your data that incorporates who, what, where, when, and scale.
  • Choose keywords wisely: Consider all of the possible interpretations of your word choices and use a thesaurus or domain-specific controlled vocabulary to add descriptive terms you may not have otherwise selected. 
  • Include as many details as you can in the metadata record for future users of the data.
  • Whenever you change your metadata record, update the metadata date (date stamp) so that metadata repositories will know which version of the record is most recent.

Transportation Research Thesaurus (TRT)

The Transportation Research Thesaurus (TRT) is a tool to provide a common and consistent language between producers and users of transportation information. The TRT covers all modes and aspects of transportation. 

The TRT is a controlled vocabulary that NTL uses when cataloging and creating metadata for both data and reports. 

To learn more information about the TRT and explore the thesaurus check it out at https://trt.trb.org/

A knowledge management document for the data lifecycle. The DMP should be a living document that is created prior to the start of the project and updated throughout to accurately reflect and track the research as it progresses through the data lifecycle. For more information check out the DMP specific page within this LibGuide

Include code books, scripts used during analysis, auxiliary tables, and any other supporting files that were created or used to collect, process, clean, or analyze the data.

Best Practice Tip: Be as complete as possible. The goals of a robust data package include fully documenting your processes  to al-low a naïve user to replicate your results; and, understand the full context of the da-ta, so that they can decide intelligently whether your data meets their reuse need

Quiz