Skip to content

Product  · 6 min read

Data Types and Formats in Observational Research

Understanding Data Storage Formats for Diverse Sensor Modalities and Other Data Sources in Observational Research.

Understanding Data Storage Formats for Diverse Sensor Modalities and Other Data Sources in Observational Research.

Data Types & Formats: The Raw Material of Observational Research

The diverse array of sensors and modalities employed in an observational research lab generates a wide variety of data types, each with its own characteristics and optimal storage formats. Understanding these formats is crucial for efficient data management, processing, and analysis. The choice of format impacts file size, ease of parsing, and compatibility with various analytical tools.

Video Data (mp4): Visual Records of Behavior

Video Data, typically stored in formats like MP4, forms the primary visual record of observed behaviors. MP4 (MPEG-4 Part 14) is a widely used container format that can store video, audio, and other data. Its popularity stems from its efficiency in compression, allowing for relatively small file sizes while maintaining good visual quality. However, even compressed video files can be very large, especially with high resolutions and frame rates, necessitating significant storage capacity. Key considerations for video data include:

  • Resolution: The number of pixels (e.g., 1920x1080 for Full HD) directly impacts visual detail and file size.
  • Frame Rate (fps): The number of frames per second (e.g., 30 fps, 60 fps, 120 fps) determines the temporal resolution of the video, crucial for capturing rapid movements.
  • Compression Codec: The algorithm used to compress the video (e.g., H.264, H.265) affects file size and playback compatibility.
  • Metadata: Information embedded within the video file, such as recording date, time, camera settings, and synchronized event markers.

Audio Data (mp3, wav): The Soundscape of Interaction

Audio Data, often captured alongside video or independently, provides the auditory context of an observational study. Common formats include MP3 and WAV.

  • WAV (Waveform Audio File Format): A standard for uncompressed audio, offering high fidelity but resulting in very large file sizes. WAV files are ideal for applications where audio quality is paramount and subsequent detailed analysis (e.g., voice analysis, acoustic event detection) is required.
  • MP3 (MPEG-1 Audio Layer 3): A popular compressed audio format that significantly reduces file size by discarding some audio information deemed imperceptible to the human ear. While efficient for storage and distribution, MP3 is generally not preferred for raw research data where subtle acoustic details might be important for analysis.

Audio data is critical for analyzing verbal communication, vocalizations, and environmental sounds, providing complementary information to visual observations.

Sensor and Event Data (CSV, EDF, HDF5): Structured Numerical Streams, Versatile for Event Logs and Annotations

Raw numerical data from biophysical sensors, eye trackers, and other digital sensors are typically stored in structured formats that facilitate programmatic access and analysis. Common formats include CSV, EDF, and HDF5.

  • CSV (Comma Separated Values): A simple, plain-text format where data values are separated by commas (or other delimiters like tabs). CSV files are highly human-readable and universally compatible with almost all data analysis software and programming languages. Beyond raw sensor data, CSV/Tab delimited Data is frequently used for storing event logs, behavioral annotations, and other structured metadata. Its simplicity and widespread compatibility make it an excellent choice for exporting coded behaviors from video analysis software, storing experimental parameters, or logging system events. Each row typically represents an event or a data point, with columns representing different attributes (e.g., timestamp, event type, duration, participant ID). They are suitable for relatively small to medium-sized datasets, but can become inefficient for very large, complex, or hierarchical data.

  • TSV (Tab Separated Values): Although the term CSV has become established, CSV files are often problematic because it is unclear which separator is used.

    • In countries where numbers with decimal places are separated by a period, a comma is usually used as the separator. This allows decimal numbers to be stored in CSV files without any problems and without confusing the separation (problems arise when thousand separators are also stored here, which leads to corruption of the file structure, as the thousand separator in these countries is often a “,”).
    • In countries that separate decimal places with a comma, a semicolon is used as the data separator in CSV files.Unfortunately, such files are also called CSV, which regularly leads to confusion and technical problems.
    • It is therefore recommended to use “TAB” as the separator and to label the file with the extension “.TSV” (Tab Separated Values).

CSV: Comma, Semikolon or Tab?

Das Trennzeichen in CSV Dateien ist:
Comma “,”: In countries that separate decimal numbers with “.” (e.g., 123.456).
Semicolon “;”: In countries that separate decimal numbers with “,” (e.g., 123,456).
It therefore makes sense in an international research context to use the tab character (character code 9) as a separator.

  • EDF (European Data Format): A standard file format for physiological time series data, particularly common in EEG and polysomnography. EDF files are designed to store multi-channel physiological recordings with precise timing information and metadata. They are robust and widely supported by specialized physiological data analysis software.

  • HDF5 (Hierarchical Data Format 5): A powerful and flexible file format designed to store and organize large amounts of heterogeneous data. HDF5 can store numerical datasets, images, and other data types in a hierarchical structure, making it ideal for complex multimodal datasets. It supports compression, parallel I/O, and is widely used in scientific computing for its efficiency and scalability.

Proprietary Binary Formats: Device-Specific Data

Many specialized research instruments (e.g., some eye trackers, high-end EEG systems) store their raw data in Proprietary Binary Formats. These formats are optimized for the specific device and its software, often offering high efficiency and preserving unique data characteristics. However, they typically require the manufacturer’s software or specific SDKs (Software Development Kits) for reading and processing, which can limit interoperability and long-term accessibility. Researchers often need to convert these proprietary formats into more open, standardized formats for broader analysis and sharing.

Event Logs: The Timeline of Experimentation

Event Logs are critical for reconstructing the timeline of an experiment. These typically record discrete events, their timestamps, and associated metadata. Events can include stimulus presentations, participant responses, system errors, or manual annotations made by researchers. Well-structured event logs are essential for synchronizing different data streams, segmenting data into trials or conditions, and performing event-related analyses. They serve as the definitive record of what happened during an experimental session, providing the temporal anchors for all other data modalities. Ideally, event logs should not be handwritten, but available as digital files. These can then be further processed as text or CSV/TSV files.

Mangold Observation Labs

Mangold Observation Labs are comprehensive turn-key solution for conducting behavioral research and observation.

Mangold Observation Lab Render