Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Managing Research Data for Public Access: 2. During the Research Project

During the Research Project

Planning for public access throughout your research will help ensure that the data archiving process goes smoothly at the end. Using your data management plan as your guide, choose file formats and naming conventions that will make it easy to organize your data and share it with others. As you begin work, take the time to document and describe your data parameters, and use consistent formatting throughout your files. Finally, keep your data safe with a backup system. As you work:

  • Step 1: Use stable file formats.
  • Step 2: Plan file names. 
  • Step 3: Describe the data.
  • Step 4: Organize data consistently.
  • Step 5: Perform quality assurance.
  • Step 6: Preserve raw data.
  • Step 7: Back up and protect your data.

Step-by-Step Guide

Step 1: Use Stable File Formats

Using platform-independent and nonproprietary formats whenever practical will maximize the future utility of your data. Use text (ASCII) file formats for tabular data, such as .txt or .csv (comma-separated values) formats. 

Some preferred file formats for different content types include:

  • Containers: TAR, GZIP, ZIP
  • Databases: XML, CSV
  • Geospatial: SHP, DBF, GeoTIFF, NetCDF
  • Moving images: MOV, MPEG, AVI, MXF
  • Sounds: WAVE, AIFF, MP3, MXF
  • Statistics: ASCII, DTA, POR, SAS, SAV
  • Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
  • Tabular data: CSV
  • Text: XML, PDF/A, HTML, ASCII, UTF-8
  • Web archive: WARC

Learn more:


Sources: Stanford University LibrariesNational Transportation Library, USGS. Image credit: USGS.

Step 2: Plan File Names

Develop naming conventions and a folder hierarchy structure early. File names should:

  • Be unique.
  • Use logical and efficient naming conventions.
  • Reflect file contents using keywords such as location, variables and conditions.
  • Be between 25 and 60 characters.
  • Avoid using capitals, special characters and spaces.
  • Use YYYYMMDD date format.
  • Use underscores between components.
  • Differentiate raw data from other files.

Sample file names:

  • YYYYMMDD_location_vehicle_count_raw.xlsx
  • bigfoot_agro_2000_gpp.tiff

Learn more: Data Best Practices and Case Studies, Stanford University Libraries, last updated November 2021.


Sources: Stanford University LibrariesUniversity of California, DavisUSGS. Image credit: USGS.

Step 3: Describe the Data

Create data documentation (such as a parameter table) as you begin work rather than waiting until your project is complete. 

  • Use commonly accepted parameter names, descriptions and units.
  • Be consistent.
  • Explicitly state units.
  • Choose a format for each parameter, explain the format in the metadata, and use that format throughout the file.
  • Use standard data formats (for example, ISO standard date format—YYYYMMDD).

 

 

 

 

 

Sample Parameter Table (USGS)


SourceUSGS. Image credit: USGS.

Step 4: Organize Data Consistently

Keep data organization consistent throughout your files.

  • Don't change or rearrange columns.
  • Include header rows; column headings should describe the content of each column.
  • In the first row, list the file name, dataset title, author, date, and names of companion files.

Spreadsheet best practices

  • One data type per cell.
  • One data point per cell.
  • Use clear variable names.
  • Use data validation—predefined, consistent categories.
  • Avoid using formatting that may not be maintained when a spreadsheet is converted to a .csv file. Don't use:
    • Commas or special characters (@, %, ^, etc.)
    • Colored text or cell shading.
    • Embedded comments.
  • Avoid empty cells, rows or columns. If there is no data for a cell, indicate why.
  • Avoid merged cells, missing headers, or multiple header rows.

An application like OpenRefine (formerly Google Refine) can help you locate and clean up inconsistent data.

Learn more: Manage Spreadsheets, Stanford University Libraries, last updated November 2021.


Sources: Stanford University LibrariesUniversity of California, DavisUSGS. Image credit: USGS.

Step 5: Perform Quality Assurance

To ensure data integrity, perform frequent checks on your data to identify any errors.

  • Assure data are delimited and line up in proper columns.
  • Check for missing values in key parameters.
  • Scan for impossible and anomalous values.
  • Perform and review statistical summaries.
  • Map location data and assess any errors.

 

 

 

 

 


Source: USGS. Image credit: USGS.

Step 6: Preserve Raw Data ("Keep Raw Data Raw")

To preserve your data and its integrity, save a read-only copy of your raw data files with no transformations, interpolation or analyses. Use a scripted language such as R, SAS or MATLAB to process data in a separate file (located in a separate directory). These scripts:

  • Serve as a record of data processing. 
  • Can be easily and quickly revised and rerun in the event of data loss or requests for edits.
  • Allow future researchers to follow up or reproduce your processing.

Source: USGS

Step 7: Back Up and Protect Your Data

As you work, create back-up copies of your data often.

  • For the best protection from loss, create three copies of each file: the original, an on-site (external) backup, and an off-site backup (such as via cloud services).
  • Choose a backup frequency based on need and risk.

To ensure that you can recover from a data loss, periodically test your ability to recover your data.

Check with the Information Technology (IT) department in your organization for advice on the best backup systems for your needs.

Learn more: 


Sources: Iowa State University LibraryStanford University LibrariesUSGS