Skip to content

Datasets

Datasets (concept)

«Dataset» is generally a broad concept meant to refer to any file, set of files, folder, or set of folders, describing a meaningful unit of biological data about some biological material for a given condition and/or treatment. A file - or a directory, etc. - containing data is referred to, in LabID, as a datafile. In many cases, a single datafile is enough to constitute the dataset, i.e. this single entity detains all parts of a meaningful unit of data about a sample. Conversely, in other cases, the meaningful unit is a combination of datafiles

Examples of datasets in genomics

  • In single-end sequencing, a dataset is composed by a single file containing all short DNA sequences.
  • Conversely, in paired-end sequencing, the meaningful unit of data is a combination of two files, each containing a part of the information.

Summing up, the quality of the dataset is pretty much assay and context dependent, and should be carefully considered before loading the data in LabID, based on the assay type, instrument settings, etc. Here are common examples

Examples of datasets in LabID

  • Illumina sequencing


    • A single FastQ file (Single-end Illumina sequencing assay)
    • A pair FastQ files (Paired-end Illumina sequencing assay)
  • Nanopore sequencing


    • A FastQ/Fast5 directory
    • A FastQ/Fast5 file, when the sequencer was set to send all reads for a given sample to a given file
  • Light and Electron Microscopy


    A directory of images, containing all image for a given sample

Important metadata files (that are for example needed for raw dataset processing) can also be registered as dataset.

Datasets in LabID

More advanced lineage model for samples

In LabID every dataset is produced by an assay and associated to its parent sample. Every dataset is constituted by at least one datafile. An additional layer (the datafie copy) is introduced in our model to be able to correctly trace multiple copies of a given datafile.

Datafile and datafile copy usage and disambiguation

The datafile is the entity that constitue the dataset and is able to trace its copies down its lineage. The datafile copy is the direct representation of the data on disk.

Multiple copies throughout the lifecycle of a dataset

Throughout its lifecycle a dataset undergoes multiple states. It is first created in the system (e.g. upon registering an assay), kept live for a while, and later archived before it is eventually deleted (and disk space effectively recovered). When a dataset is archived, an additional copy of its datafile copies is made and sent to long-term storage. At this point, we have a single dataset and a single datafile, but two copies: one online and directly accessible and one offline on long-term storage. The online copy remain accessible until it is deleted to recover disk space.

Multiple live copies of the same dataset

It is not uncommon to possess multiple copies of the same data (for example, one on disk on the group share, one in the cloud e.g. to be accessed by third-party tools). LabID makes sure the dataset and the datafiles are never duplicated, but instead provides a second live datafile copy. Every datafile copy can always be traced back to its parent datafile and dataset.

Raw datasets

A raw dataset refers to the primary dataset acquired - during an assay - by «reading-through» some biological material, using a specialised scientific instrument. Ideally, the raw data is presented in the same exact format as output by the instrument. Raw formats vary among instruments and disciplines (e.g. sequencers commonly output FastQ 1 files, the raw text format used to store short DNA sequences).

Non-raw datasets

Non-raw or derived datasets are either a post-process version of the original raw data (e.g. normalized/filtered datasets produced by a processing pipeline) or data that record information about more than one dataset/sample, or about the about assay (referred to as metadata). Typical examples of processed datasets includes BAM or WIG files (common in sequencing), summary tables (gene expression count table), etc.

  • Raw datasets must be linked to samples and to assays
  • Derived datasets may be linked to samples and to assays
  • Datasets may have parent dataset(s). This is also true for raw datasets to explicit a dependency between raw datasets as in correlative microscopy.

Dataset Lineage

More advanced lineage model for datasets

Where to find my datasets

The dataset list page allows to easily locate datasets by e.g. assays, studies, parent samples, ownership, name, etc.

The Data Management menu gives access to Datasets and their composing Datafiles, Studies which group related datasets together (e.g. all raw and derived datasets of a RNA-seq time course) and dataset collections that allow grouping datasets in a custom way. Finally, datasets can be archived to tape to free expensive space on the primary volume

Dataset Collections

Datasets are assembled into Dataset Collections

Collections allows for grouping together datasets of a given type. This allows for example in a sequencing study to group together raw read files (FastQ) in a collection, and aligned read files (BAM) in another separate collection.

Study

Datasets belongs to a study

Project

Datasets (and their parent samples) belong to a project

  1. «FASTQ format» article on Wikipedia: https://en.wikipedia.org/wiki/FASTQ_format