Skip to content

Dataset Management

Good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.

The FAIR Guiding Principles for scientific data management and stewardship1

Reusing data is a core need in life sciences. This does not only include reusing data in downstream research within a lab, but also accessing complementary well described data sources from public registries.

Over the past decades, efforts have been made in defining concepts and priorities to achieve good data management (e.g. Wilkinson, M., Dumontier, M., Aalbersberg, I. et al.). These works provide invaluable insights on how to achieve traceability and reusability of data, and it is clear that this only can be achieved with the help of cutting-edge data management software.

LabID is developed with these concerns and best practices at heart. We have modeled several helpful entities - samples, assays, datasets, etc. - based on experience, existing specifications and collaborative work with our community, in order to better address everyone dataset management needs.

All entities can independently or conjointly be managed. Relationships between these entities are also established, ensuring complete lineage traceability.

A Complete Lineage

Datasets (and their metadata) are never registered in isolation. The registration process require complementary information to be provided, and makes it possible to record and trace the entire dataset lineage. We track how a dataset was obtained (exhaustively recording assay information) as well as from which biological material (or sample) it was acquired.

A dataset is generated by an assay, and derives from a sample, establishing the following chain

(Specimen ) Sample(s) Assay Dataset(s)

All this information is then made easily accessible through the interface. This becomes particularly handy when submitting studies to journals or when attempting to reuse data for further analysis.

Global overview of the relationships between models

Data Management Overview. Assays (yellow diamond) consume samples (green circles) - that may derive from Specimen and/or other samples - and generate raw datasets (red circles) which in turn can be further processed. Related raw and processed datasets are grouped into a Study (grey rectangle) that belongs to a Project (blue rectangle)

The assay is consumes samples to produce datasets. The assay model ensures important metadata from the scientific equipment and data producing run are adequatly captured. This information is indeed later needed when submitting raw datasets to public repositories.

A raw dataset corresponds to the data produced by the instrument (sequencer, microscope), before any post-processing happens. For example, sequencer commonly produce FastQ files (Illumina, Nanopore), or Fast5 files (Nanopore); Zeiss microscopes produce .czi files.

Video: Understanding LabID object relationships ~19min

This tutorial (outdated in term of user interface) gives a complete LabID overview with a focus on inter-connection between Collections & ELN, Biomaterial management and Assay & Dataset management modules. It describes how the Assay & Dataset section works together with the rest of LabID, using Illumina NGS Assay as an example.

Content: relationships between samples and datasets, samples and ELN, datasets and assays, studies and projects. Difference between ELN and protocols

References

  1. Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. «The FAIR Guiding Principles for scientific data management and stewardship». Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18