LibGuides: Research Data Management: Data Packages

Data Package

What is a Data Package?

A “Data Package” is the dataset, the data management plan (DMP), and all other documentation needed to contextualize the dataset for any and all users and re-users.

Types of Data Packages:

Submission Information Package (SIP): Items that have been submitted by the depositor.
Archival Information Package (AIP): A package that contains data that will be stored within a digital archive.
Dissemination Information Package (DIP): A package created from the Archival Information Package (AIP) to distribute digital content to users.

Breaking Down A Data Package

Best data file format for long-term preservation of tabular data => .CSV

Open, Non-Proprietary File Format
Backward and Forward Compatible
Long-Term Preservable

Best Practice Tip: In order for data to be machine readable and interoperable, it should have only two types of rows: a single header row, and all others rows are data.

The purpose of the README document is to provide all of that contextual information that could not be included in the data file because it would break the data.

NTL README Template Contents:
- General Information
- Sharing Access Information
- Data and Related File Overview
- Methodological Information
- Data-Specific Information and Data Dictionary
- Appendices

Best Practice Tip: In order for you and future users to know which of the dozens of “README.txt” files on your desktop go with which datasets, use clarifying, human-readable files names, such as:

bts_osp_national_census_ferry_operators_2016_README_2017_10_26.txt

NTL README Template
Here is NTL's current README template.
NTL README Template (Markdown Version)
Here is NTL's current README template in Markdown format. This format is more suited for GitHub repositories.

Data dictionaries store and communicate metadata about data in a database, a system, or data used by applications. Data dictionary contents can vary but typically include some or all of the following:

A listing of data objects (names and definitions)
Detailed properties of data elements (data type, size, nullability, optionality, indexes)
Entity-relationship (ER) and other system-level diagrams
Reference data (classification and descriptive domains)
Missing data and quality-indicator codes
Business rules, such as for validation of a schema or data quality

Data Dictionaries are useful for a number of reasons.

Assist in avoiding data inconsistencies across a project
Help define conventions that are to be used across a project
Provide consistency in the collection and use of data across multiple members of a research team
Make data easier to analyze
Enforce the use of Data Standards

The DCAT-US Metadata Scheme (.json) is a machine-readable format that is required by both DOT's open data policy and data.gov.

Additionally, a researcher should include any additional metadata schemes that are relevant to the data or field of study.

NTL's DCAT-US Metadata Template
We recommend using Visual Studio Code when working with this template.

Importance of Metadata

Metadata is an inextricable part of managing records or digital information in any format. The use of metadata supports methods to identify, authenticate, describe, locate, and manage resources in a precise and consistent way. This precision and consistency in turn allows information providers to meet research, business, accountability, and archival requirements. Additionally, ensuring complete metadata, and updating as needed, is critical to maintaining data quality.

Types of Metadata

There are many different kinds of metadata. In the world of digital objects, metadata is usually divided into 3 to 5 categories:

Administrative metadata, including:
- Rights metadata (i.e., intellectual property rights and use information)
- Technical metadata (i.e., technical details about the object and its instantiation like its file format, file size, and how to open, access and use it
- Preservation metadata (i.e., a log of the series of actions taken against an object in order to ensure it longevity and viability)
Descriptive metadata describes a resource, its content, its identifying characteristics and its "aboutness"
Structural metadata describes how the pieces of a single object fit together and how an object exists in relationship to other objects

Best Practices for Metadata

Gather all information together, especially if multiple people have information that you need.
Use information that is already developed.
Choose a descriptive title for your data that incorporates who, what, where, when, and scale.
Choose keywords wisely: Consider all of the possible interpretations of your word choices and use a thesaurus or domain-specific controlled vocabulary to add descriptive terms you may not have otherwise selected.
Include as many details as you can in the metadata record for future users of the data.
Whenever you change your metadata record, update the metadata date (date stamp) so that metadata repositories will know which version of the record is most recent.

Transportation Research Thesaurus (TRT)

The Transportation Research Thesaurus (TRT) is a tool to provide a common and consistent language between producers and users of transportation information. The TRT covers all modes and aspects of transportation.

The TRT is a controlled vocabulary that NTL uses when cataloging and creating metadata for both data and reports.

To learn more information about the TRT and explore the thesaurus check it out at https://trt.trb.org/.

A knowledge management document for the data lifecycle. The DMP should be a living document that is created prior to the start of the project and updated throughout to accurately reflect and track the research as it progresses through the data lifecycle. For more information check out the DMP specific page within this LibGuide.

Include code books, scripts used during analysis, auxiliary tables, and any other supporting files that were created or used to collect, process, clean, or analyze the data.

Best Practice Tip: Be as complete as possible. The goals of a robust data package include fully documenting your processes to al-low a naïve user to replicate your results; and, understand the full context of the da-ta, so that they can decide intelligently whether your data meets their reuse need

Research Data Management: Data Packages

What is in a Data Pacakge?