Skip to content

Product  · 6 min read

Data Types and Formats in Observational Research

A practical guide to data formats in observational research: MP4 video, WAV/MP3 audio, CSV/TSV event logs, EDF physiological data, and HDF5 - what each format is best suited for and why TSV beats CSV.

A practical guide to data formats in observational research: MP4 video, WAV/MP3 audio, CSV/TSV event logs, EDF physiological data, and HDF5 - what each format is best suited for and why TSV beats CSV.

Data Types & Formats: The Raw Material of Observational Research

The diverse array of sensors and modalities employed in an observational research lab generates a wide variety of data types, each with its own characteristics and optimal storage formats. Understanding these formats is crucial for efficient data management, processing, and analysis. The choice of format impacts file size, ease of parsing, and compatibility with various analytical tools.

Video Data (mp4): Visual Records of Behavior

Video Data, typically stored in formats like MP4, forms the primary visual record of observed behaviors. MP4 (MPEG-4 Part 14) is a widely used container format that can store video, audio, and other data. Its popularity stems from its efficiency in compression, allowing for relatively small file sizes while maintaining good visual quality. However, even compressed video files can be very large, especially with high resolutions and frame rates, necessitating significant storage capacity. Key considerations for video data include:

  • Resolution: The number of pixels (e.g., 1920x1080 for Full HD) directly impacts visual detail and file size.
  • Frame Rate (fps): The number of frames per second (e.g., 30 fps, 60 fps, 120 fps) determines the temporal resolution of the video, crucial for capturing rapid movements.
  • Compression Codec: The algorithm used to compress the video (e.g., H.264, H.265) affects file size and playback compatibility.
  • Metadata: Information embedded within the video file, such as recording date, time, camera settings, and synchronized event markers.

Audio Data (mp3, wav): The Soundscape of Interaction

Audio Data, often captured alongside video or independently, provides the auditory context of an observational study. Common formats include MP3 and WAV.

  • WAV (Waveform Audio File Format): A standard for uncompressed audio, offering high fidelity but resulting in very large file sizes. WAV files are ideal for applications where audio quality is paramount and subsequent detailed analysis (e.g., voice analysis, acoustic event detection) is required.
  • MP3 (MPEG-1 Audio Layer 3): A popular compressed audio format that significantly reduces file size by discarding some audio information deemed imperceptible to the human ear. While efficient for storage and distribution, MP3 is generally not preferred for raw research data where subtle acoustic details might be important for analysis.

Audio data is critical for analyzing verbal communication, vocalizations, and environmental sounds, providing complementary information to visual observations.

Sensor and Event Data (CSV, EDF, HDF5): Structured Numerical Streams, Versatile for Event Logs and Annotations

Raw numerical data from biophysical sensors, eye trackers, and other digital sensors are typically stored in structured formats that facilitate programmatic access and analysis. Common formats include CSV, EDF, and HDF5.

  • CSV (Comma Separated Values): A simple, plain-text format where data values are separated by commas (or other delimiters like tabs). CSV files are highly human-readable and universally compatible with almost all data analysis software and programming languages. Beyond raw sensor data, CSV/Tab delimited Data is frequently used for storing event logs, behavioral annotations, and other structured metadata. Its simplicity and widespread compatibility make it an excellent choice for exporting coded behaviors from video analysis software, storing experimental parameters, or logging system events. Each row typically represents an event or a data point, with columns representing different attributes (e.g., timestamp, event type, duration, participant ID). They are suitable for relatively small to medium-sized datasets, but can become inefficient for very large, complex, or hierarchical data.

  • TSV (Tab Separated Values): Although the term CSV has become established, CSV files are often problematic because it is unclear which separator is used.

    • In countries where numbers with decimal places are separated by a period, a comma is usually used as the separator. This allows decimal numbers to be stored in CSV files without without confusing the data separation (problems arise when thousand separators are also stored here, which leads to corruption of the file structure, as the thousand separator in these countries is often a “,”).
    • In countries that separate decimal places with a comma, a semicolon is used as the data separator in CSV files.Unfortunately, such files are also called CSV, which regularly leads to confusion and technical problems.
    • It is therefore recommended to use “TAB” as the separator and to label the file with the extension “.TSV” (Tab Separated Values).

CSV: Comma, Semikolon or Tab?

Das Trennzeichen in CSV Dateien ist:
Comma “,”: In countries that separate decimal numbers with “.” (e.g., 123.456).
Semicolon “;”: In countries that separate decimal numbers with “,” (e.g., 123,456).
It therefore makes sense in an international research context to use the tab character (character code 9) as a separator.

  • EDF (European Data Format): A standard file format for physiological time series data, particularly common in EEG and polysomnography. EDF files are designed to store multi-channel physiological recordings with precise timing information and metadata. They are robust and widely supported by specialized physiological data analysis software.

  • HDF5 (Hierarchical Data Format 5): A powerful and flexible file format designed to store and organize large amounts of heterogeneous data. HDF5 can store numerical datasets, images, and other data types in a hierarchical structure, making it ideal for complex multimodal datasets. It supports compression, parallel I/O, and is widely used in scientific computing for its efficiency and scalability.

Proprietary Binary Formats: Device-Specific Data

Many specialized research instruments (e.g., some eye trackers, high-end EEG systems) store their raw data in Proprietary Binary Formats. These formats are optimized for the specific device and its software, often offering high efficiency and preserving unique data characteristics. However, they typically require the manufacturer’s software or specific SDKs (Software Development Kits) for reading and processing, which can limit interoperability and long-term accessibility. Researchers often need to convert these proprietary formats into more open, standardized formats for broader analysis and sharing.

Event Logs: The Timeline of Experimentation

Event Logs are critical for reconstructing the timeline of an experiment. These typically record discrete events, their timestamps, and associated metadata. Events can include stimulus presentations, participant responses, system errors, or manual annotations made by researchers. Well-structured event logs are essential for synchronizing different data streams, segmenting data into trials or conditions, and performing event-related analyses. They serve as the definitive record of what happened during an experimental session, providing the temporal anchors for all other data modalities. Ideally, event logs should not be handwritten, but available as digital files. These can then be further processed as text or CSV/TSV files.

FAQ - Frequently Asked Questions

What video format is ideal for observational research?
MP4 (MPEG-4) is the standard container format for video data in observation labs. Modern codecs such as H.264 and H.265 achieve an optimal balance between file size, resolution, and frame rate. Video files typically use the .mp4 extension with the video data internally compressed using H.264 or H.265.
What is the difference between WAV and MP3 for research audio recordings?
WAV stores uncompressed audio with full fidelity, making it ideal for voice analysis and acoustic event detection. MP3 uses lossy compression to reduce file size but discards subtle acoustic information - acceptable for transcription-only use cases but not recommended for acoustic research.
Why should researchers use TSV instead of CSV for international data export?
CSV is ambiguous: countries using '.' as a decimal separator use ',' as a delimiter, while countries using ',' as a decimal separator use ';'. Both are called CSV and cause frequent import errors. TSV (tab-separated values) avoids this ambiguity entirely and is the recommended format for international research contexts.
What is EDF (European Data Format) and when is it used in research?
EDF (European Data Format) is a standardized format for physiological time series data, widely used for EEG and polysomnography recordings. It stores multi-channel data with precise timing and metadata, and is supported by most physiological data analysis software, including Mangold DataView.
What are event logs and why are they critical in observational research?
Event logs are timestamped records of discrete occurrences during an experiment - stimulus presentations, participant responses, system events, or researcher annotations. They provide the temporal anchors needed to synchronize all other data streams and to segment data into trials or conditions for analysis.
When should HDF5 be used instead of CSV for research sensor data?
HDF5 is suited for large, complex, or hierarchically structured multimodal datasets that would become unwieldy in flat CSV files. It supports compression, parallel I/O, and hierarchical data organization, making it the preferred choice in large-scale scientific computing contexts.

Mangold Observation Labs

Mangold Observation Labs are comprehensive turnkey solution for conducting behavioral research and observation.

Mangold Observation Lab Render