Skip to content

Use cases

Fetching Datasets metadata

The get module can be used to fetch a list of objects. The general command to fetch any item from LabID is items (check labid get items --help for details). In the current use case, we use the datasets command that offers usual filters to list datasets.

Tips

Note that datasets is actually a wrapper for the items command with item's model set to dataset.

Using labid get datasets, you can easily list datasets that belong to a project, study, an assay or a collection using their UUID. For example, to list all datasets generated by the Illumina assay id=90988e6e-b55f-470e-8f0a-e887e12131d2, run:

labid get datasets -a 90988e6e-b55f-470e-8f0a-e887e12131d2

The list of datasets is printed on the stdout as a JSON array.

Using a tool like jq, you can further process the JSON response; for example to extract the name and UUID of each dataset that belongs to a particular study and format the extracted values as a tsv (tab-separated value format):

> labid get datasets -s 568af9d5-6f36-432f-a1cb-7faeb7c45697 | jq -r '.[]|[.name, .id] | @tsv'
sciATAC_SS157_HTGWCBGXB_lane1   8182d59b-cb1b-49f3-894d-a12f7cf3a9a0
sciATAC_SS159_H2LCNBGXC_lane1   a66fe070-6853-4ed7-836b-9fed662e868d
sciATAC_SS158_H77KCBGXC_lane1   c410cb30-71c7-45e4-978c-0375868fca9e
sciATAC_SS148_H5Y3MBGXB_lane1   d3d4de1b-6dc8-499b-8186-60f879ba1af3

Learn about additional options with labid get datasets --help

Linking Dataset Files in your project folder

In LabID's protected data library, the datafiles are organised by technology and assays. In a project context, it is common practice to store such data in a project folder. Since datasets can be used in multiple projects, the best approach to avoid duplicating large datafiles into several project folders is to link the data stored in the LabID protected data library in your project using symbolic links.

This can be achieved with the export module and the links command. For example to link all the dataset files from a given study to your project folder, one can simply issue:

> labid export links -s study_id  --target-dir /path/to/project/src/

As a result the symbolic links will be named with the original dataset file name e.g.

> ls -l /path/to/project/src/
original_fastq_file_forsample1_R1.txt.gz -> /path/to/labid/Data/Assay/sequencing/assayX/original_fastq_file_forsample1_R1.txt.gz
...
original_fastq_file_forsample3_R1.txt.gz -> /path/to/labid/Data/Assay/sequencing/assayX/original_fastq_file_forsample3_R1.txt.gz
Adding links to additional data

In the course of a project, it is quite usual to get new datasets for a given study (e.g. new replicates or conditions). To generate links to this new data files, you can simply call the same command using the --resume option_i.e._

> labid export links --resume -s study_id  --target-dir /path/to/project/src/

It is sometimes useful to chose, for the symbolic link, a different name than the original name e.g. when the data files were not named in a consistent way, if original names are very long or if a tool expects a precise naming pattern of the input files. The latter is often the case when using workflow engines like snakemake or nextflow where file names must include key factors that are parsed by the workflow to infer replicates, control vs treatment grouping...

To solve this issue, the -f, --link-name-formulae option can be used to apply a formulae (or renaming pattern) to generate the link names using any available metadata. The pattern is a mix of plain text and metadata placeholders in the form {metadata_name}. When creating the link names, the metadata placeholders are replaced by their valu e.g. {Sample} is replaced by the sample name (with spaces replaced with _).

For example, if the datasets are fastq files, one can use:

> labid export links -f '{Sample}_{Read Type}.fastq.gz' -s study_id  --target-dir /path/to/project/src/ 

to rename all the fastq files link:

> ls -l /path/to/project/src/
sample1_read1.fastq.gz -> /path/to/labid/Data/Assay/sequencing/assayX/original_fastq_file_forsample1_R1.txt.gz
...
sample3_read1.fastq.gz -> /path/to/labid/Data/Assay/sequencing/assayX/original_fastq_file_forsample3_R1.txt.gz
What are the available metadata placeholders?

One can learn the different metadata placeholders available in their context (here a particular study) to build the -f pattern. Note that some are always available (e.g. Sample) while others are context specific (e.g. annotations).

> labid export links --list-data-headers -s 568af9d5-6f36-432f-a1cb-7faeb7c45697  --target-dir /path/to/project/src/ 

Available headers in this data context: ['Dataset ID', 'Dataset', 'Type', 'Study ID', 'Study', 'File Name', 'FilePath', 
'Checksum', 'Read Type', 'Assay ID', 'Assay Name', 'Assay Producer', 'Instrument Name', 'Instrument Model', 'Sample ID', 
'Sample', 'Organism', 'Material Type', 'Sample QC', 'Library Strategy', 'Library Orientation', 'Library Source', 
'Library Selection', 'RT Primer Type', 'Library End Bias', 'Library Strand', 'Is Control Sample', 'Is Single Cell Sample', 
'Single Cell Number', 'Single Cell Library Construction', 'Single Cell Isolation', 'Single Cell Well Quality', 
'Has Screen Plate Info', 'Protocols', 'Sample[Age]', 'Sample[InitialTimePoint]', 
'Sample[Organism]', 'Sample[SampleType]', 'Sample[Sex]', 'Sample[StrainOrLine]', 'Dataset[emBASE ID]', 'Dataset[emBASE URL]']

When your samples, assays, and/or datasets have been annotated; these annotations are also availabl e.g. the Age annotation on Sample is available as Sample[Age] and could then be used in a -f pattern like {Sample}_{Sample[Age]}_{Read Type}.fastq.gz

Spaces in replacement values are substituted with _ (underscore) e.g. an Age annotation like 6 hrs will be injected as '6_hrs' in the link name.

Learn about additional options with labid export links --help.

Publishing your NGS Data (i.e. Study) to EBI's BioStudy & ENA Repositories

A MAGE-TAB document can be exported with

labid export study -s <study_uuid> -o /g/funcgen/tmp/ -f study_magetab.txt

The exported TSV document is not ready for submission and should be manually reviewed and edited. In particular:

  • the experimental factors should be edited to reflect the study's design terms
  • ontology term mapping may be missing.

Once edited and consolidated (you may have to go back to LabID and fill in missing information/annotation), the MAGE-TAB can be submitted to EBI's Annotare (NGS data) for a submission to ENA/BioStudies.

Before sending the submission email to EBI, you should upload the fastQ files. This can be done easily thanks to the labid export links command explained in the previous section. Practically speaking, we advise the following strategy:

  • create a submission directory submission_data somewhere
  • link the study's dataset files with labid export links -s study_uuid -t /path/to/submission_data
  • use FTP or aspera upload from within the submission_data
  • email the reviewed MAGE-TAB to EBI Annotare indicating how your transferred to data files (ie FTP, aspera...)

Registering Raw Data with an Assay (Aviti example)

This is achieved using the register module and the assays command that lets you register one or more assays and associated raw data. This is the CLI equivalent of the UI data loader wizard.

The register assays commands make use of a sniffer. Sniffers are plugins which role is to parse a directory (the 'run directory') content into a InstrumentRun object which contains the assay, datasets and samples definition. While LabID comes with few sniffers ready to use, a common use case is to write your own sniffer (please refer to the sniffer documentation section and make sure to check the available ones to get started).

Here we demonstrate how to use the command using an example of registering an AVITI sequencing assay using the Aviti sniffer named AvitiRunSniffer.

> labid register assays -n AvitiRunSniffer -t sequencing -i /path/to/aviti_run_dir/ -s study_id  
Please make sure to:

  • replace the study_id with the actual study UUID
  • adapt the path to the AVITI run folder with the actual path

Also note that:

  • labid register assays --help prints help and details about options globally available with the labid register assays.
  • the sniffer specifications and the sniffer specific options for the AvitiRunSniffer can be seen with labid get sniffers -n AvitiRunSniffer.

For example, the --sniffer-param (or -x) option allows to forward sniffer specific options to the sniffer:

> labid register assays -n AvitiRunSniffer -x 'sniffer_option1=XX' -x 'sniffer_option2=YY -t sequencing -i /path/to/aviti_run_dir/ -s study_id'

One can thus tune the AvitiRunSniffer sniffer behavior relative to sample validation using the following sniffer options:

  • tell the sniffer that the sequencing libraries must already exist in LabID with sample_must_exist i.e. add -x 'sample_must_exist=True' to the command line.
    • Using this option makes the labid register assays fails if the sample is not found.
    • Using this option makes it possible to connect datasets from different runs to the same sequencing library (e.g. re-sequencing use case)
  • rewrite the sample names on the fly with sample_name_regex i.e. add -x 'sample_name_regex=(.+)(PE20|iTRU)\w+' to validate the sample name format and replace the sample name by the first matching group of the regular expression. Here a sample name like sample1PE2034 will be replaced by sample1 and the PE2034 part will be ignored.
    • used in combination with sample_must_exist=True, the sample lookup in LabID will be done using sample1 (not sample1PE2034); which is very useful to deal with POOL libraries.
    • more complex sample name reformatting can be addressed using the sample_reformat sniffer option
Where data should be located to be registered?

To register data in LabID, the data to register can be located in:

  • the user's dropbox. In this case the registration process includes copying the data from the user dropbox to the LabID protected data library of the user's group. The data can be deleted from the user dropbox only after you received a "success" email.
  • in the LabID protected data library of the user's group. This option is only available if your installation uses trusted transfer services able to copy data (and set correct ownership) directly in the LabID protected data library or if the LabID protected data library is not protected (highly discouraged).
AvitiRunSniffer description and expectations

One can learn details about the AvitiRunSniffer sniffer expectations with:

> labid get sniffers -n AvitiRunSniffer
Sniffer: AvitiRunSniffer [AssaySniffer]
Supported technology: sequencing
Supported platforms: ['Element Biosciences']
Supported sniffer parameters:
  - sample_name_regex : regex to validate the sample name (default: none i.e. no validation)
  - sample_must_exist : True or False. If True the sample must exist in LabID (default=False)
  - sample_type : The sample type for look up or create a new one (default=SEQUENCINGLIBRARY)
  - demultiplexed_folder_name : The name of the demultiplexed folder in the run folder (default=Samples)
  - project_folder_name : The name of the project folder in the demultiplexed folder (default=DefaultProject)
  - unassigned_reads_folder_name : The name of the unassigned reads folder in the demultiplexed folder 
  - ignored_phix_folders : ignore PhiX spike-in folders (True|False) i.e. folders starting with "PhiX_..."
  - test_mode : switch test mode on (True|False).

    This sniffer extracts raw file and metadata from a folder generated by an AVITI sequencer run. 
    The following metadata files are expected at the root of the run folder (in brackets their mandatory status):
    - RunManifest.json (mandatory)
    - RunParameters.json (mandatory)
    - IndexAssignment.csv (optional)
    - Metrics.csv (optional)
    - UnassignedSequences.csv (optional)
    - RunManifest.csv (optional)
    - RunStats.json (optional)
    - info (optional)

    The run folder must also contain a subfolder named 'Samples' (can be 
    adapted using the 'demultiplexed_folder_name' parameter) which must contain a project subfolder named 
    'DefaultProject' (can be adapted using the 'project_folder_name' parameter). 
    The project subfolder contains the demultiplexed sample fastq files (compressed) with each sample's files 
    stored in their own subfolder named after the sample name; and metadata files:
    - <project_name>_IndexAssignment.csv (optional)
    - <project_name>_Metrics.csv (optional)
    - <project_name>_RunStats.json (optional)
    - <project_name>_QC.html (optional) 

    Optionally, a subfolder named 'Unassigned' (can be adapted using the 
    'unassigned_reads_folder_name' containing the unassigned reads may be found in the demultiplexed folder.  

    If PhiX spike-in samples are present (folder named like 'PhiX_...'), they are ignored by default (this can 
    be changed using the ignored_phix_folders parameter). 

    Example of expected directory structure:

    <root_run_folder>
        |--- RunManifest.json
        |--- RunParameters.json
        |--- IndexAssignment.csv
        ...
        |--- Samples
            |--- DefaultProject
                |--- DefaultProject_IndexAssignment.csv
                |--- DefaultProject_Metrics.csv
                |--- DefaultProject_RunStats.json
                |--- <sample_name_1>
                    |--- <sample_name_1>_R1.fastq.gz
                    |--- <sample_name_1>_R2.fastq.gz
                |--- <sample_name_2>
                    |--- <sample_name_2>_R1.fastq.gz
                    |--- <sample_name_2>_R2.fastq.gz
                ...
                |--- PhiX_blah (ignored by default)
            |--- Unassigned (optional)
                |--- Unassigned_R1.fastq.gz
                |--- Unassigned_R2.fastq.gz

Batch registering Raw Data with an Assay

This is achieved using the register module where the batch command lets you register assays from a root directory where each sub-directory corresponds to one assay.

Each sub-dir is expected to have the same structure and obey the same naming convention.

A common aspect of the batch and the assays commands is the use of a sniffer. Sniffers are plugins which role is to parse a directory content into a InstrumentRun object which contains the assay, datasets and samples definition. While LabID comes with few sniffers ready to use, a common use case is to write your own sniffer (please refer to the sniffer documentation section and make sure to check the available ones to get started).

The batch command is particularly suited to automate primary data registration. In such a scenario, the labid register batch is automatically and periodically called by e.g. a cron job to inspect a root directory. Here, the --daemon and the --db-name options should be provided to enable job tracking. We also recommend to redirect the logging messages to a log file using the global --log-file option.

A simple command for automated data registration would look like:

> labid --log-level info --log-file job.log register batch --daemon --db-name db.sqlite -r /path/to/root_dir/ -s study_id -n CopasVisionAssayLoader 

At each cron call, the command inspects the root dir (e.g. /path/to/root_dir/) for sub-folders. Each sub-folder found is then assessed for:

  • Should the folder be skip over?_i.e._if the folder was already processed (completed job status) or flagged in error state
  • Is the folder ready to process? If ready the data is parsed using the sniffer and the extracted run submitted to LabID. Submitted folders are flagged with the processing status
  • Is the folder now loaded in LabID?_i.e._when submitted data (processing status) has been fully loaded in LabID, the folder job status is set to completed
  • Should the folder be deleted?_i.e._folder with completed status will be deleted after a security period of 48 hrs (default) when --delete is true.

When --delete is used, make sure to set permissions correctly so the user running the cron is allowed to delete the data folder (else an error occurs)

How is a folder deemed ready to process?

To avoid processing a folder too quickly (for example while data is still copying), a overall folder md5 hash is computed using all the relative file path and size in bytes: a list with <file_relpath> <byte size> is build, sorted and the result multiline string is used to compute the md5 hash.

The directory is deemed ready when this md5 hash does not change between two consecutive call (i.e. cron jobs)

As a result, this means you should always copy a complete directory (i.e. containing all the data) in the monitored root dir. Alternatively, for example when the data is acquired continously e.g. by a microscope, you should give enough time between two consecutive cron runs that accounts for the instrument behavior.

you should never create a new folder under the root dir and only partially copy data, pause and copy data again later... as you may end up either with an error or an assay with partial data that will need to be clean up manually. If automated input data deletion is activated, data loss may even occur.

When using periodic tasks to automatically load data, we strongly advise to turn email reporting on. We also suggest to create a technical user to be used for the automation. The command to register as a cron job could look like:

> labid --log-level info --log-file job.log register batch \
  --config-path ~/.labid/technicaluser.yml
  --daemon --db-name db.sqlite \
  -r /path/to/root_dir/ \ 
  -s study_id -n CopasVisionAssayLoader \
  --delete --error-email admin@fake.com --success-email user@fake.com 

Please use labid register batch --help for help and details about options

One can learn which sniffers are available in their installation using labid get sniffers:

> labid get sniffers 
Sniffer: NanoporeAssaySniffer [AssaySniffer]
Sniffer: GeneCoreAssaySniffer [AssaySniffer]
Sniffer: SeqStrandAssayValidator [AssaySniffer]
Sniffer: TRECSequencingPlateAssayValidator [AssaySniffer]
Sniffer: CopasVisionAssayLoader [AssaySniffer]

and get details about a particular sniffer with labid get sniffers -n NanoporeAssaySniffer

Sniffer Detail Example: NanoporeAssaySniffer
> labid get sniffers -n NanoporeAssaySniffer
Sniffer: NanoporeAssaySniffer [AssaySniffer]
Supported technology: sequencing
Supported platforms: ['NANOPORE']

This sniffer looks into a directory expecting the native project_id structure created by nanopore sequencer:
    - PROJECT_FOLDER => an option project_id folder regrouping multiple samples & runs
        - LIBRARY/SAMPLE_FOLDER(s) => an optional library (potentially multiplexed) folder containing 
                                    results from 1 or more runs (tech replicates)
            - RUN_FOLDER(s)
                - barcode_alignment_* file: contains a single line when the sample is not a multiplexed library.
                                            MANDATORY
                - final_summary_* file: key=value summary of main parameters. 
                                        MANDATORY
                - report_*.md file: a multi-section file with an initial JSON session holding all but more 
                                    params than final_summary_* file
                                    MANDATORY
                - duty_time_* file: we will assume this optional (as we dont use it)
                - throughput_* file: we will assume this optional (as we dont use it)
                - other_reports/*csv: additional files generate during run. We wont assume anything here   

                If base calling was OFF:
                - fast5 | pod5 : a directory containing the raw fast5/pod5 signal. MANDATORY

                If base calling was ON:
                - sequencing_summary_* file: a big file listing all the reads sequenced and in which fastQ/fast5
                                            file(s) they are in, this file is multi Gb big.
                                            MANDATORY when base calling is true

                - (fast|pod)5_fail & (fast|pod)5_pass: directories containing the fast5/pod5 files for failed
                                            and passed reads.
                                            A sub-directory layer 'barcode01-12' and 'unclassified' is present 
                                            if the library is multiplexed. 
                                            one of the two is MANDATORY
                - fastq_fail & fastq_pass: optional directories (present if base calling was done) containing 
                                            the fastq files for failed and passed reads. 

The run folder detection occurs by looking for the presence of files matching the pattern
'final_summary_*.txt' and/or 'report_*.md' (both must be found). 
Each directory containing such a file will be parsed into a nanopore assay. 

When the fast5/pod5 (and fastq) directories contain more than one file, the directories are registered as 
'multi-fast5'/'multi-pod5'/'multi-fastq' dataset directories. When unique files are found per dir_path, 
there are registered as pod5/fast5/fastq dataset files. 
 Each '(fast|pod)5', '(fast|pod)5_pass'/'(fast|pod)5_fail', 'fastq_pass' and 'fastq_fail'
 will end up as a different dataset collection when found. 

The 'sequencing_summary_* file' and verbose other metadata files are also registered as a dataset

An Assay is created for each *run*.

The instrument serial number (matching the instrument's code in LabID) is taken from the metadata 
'sequencer_serial_number' when present. When absent, the serial number will be created as 
'<device_type>-<host_product_serial_number>' to differenciate between gridion and p2 models (as both would 
indicate the same 'host_product_serial_number') 

Registering Computed Datasets

This is achieved using the collection command from the register module. The difference with the batch and assays commands is that collection registers derived data (i.e. processed) as opposed to primary data. While primary data registration must include the description of the assay performed to generate the primary data, the registration of derived data is simpler.

The command will extract all the datasets, optionally matching some pattern (e.g. *.bam), and register them in LabID, as a single dataset collection (this also to easily find them back). As always when register datasets, an existing study should also be given and datasets will be added to it on top of the dataset collection.

Please see labid register collection --help to learn all available options

Automating NGS Data Validation

LabID API allows to initialise assays in the server. Upon such assay initialization, the assay state is set to initialised and the assay owner receives an email with a link to the assay detail page where a Register button is available. The assay owner should use this Register button to finalize the assay registration and eventually create the datasets, input samples... after which the assay state is changed to registered.

The labid validate rundir -r /path/to/rundir lets you validate a single assay where the /path/to/rundir points to the assay run directory within the LabID protected data library (e.g. DATA_ROOT_DIR/Data/Assays/sequencing/2024/a_rundir/).

Similar to the batch registration of assays (previous section), pre-registered initialised assays can be automatically validated in batch. This is particularly useful when similar assays are produced daily and their run dir obey to strict naming and structure. Batch mode is available using labid validate batch.

With both the rundir and batch commands, the assay validation is again assessed using a Sniffer. For example, batch validation of all qualifying initialized assays can be performed using

labid validate batch -n SeqStrandAssayValidator

The command looks for any initialised assay in LabID and try to register/validate them according to the sniffer rules. We highly recommend to implement the sniffer's function dir_qualifies() that will allow to skip over assays that do not qualify for automated registration.

Here again, the command offers options (like email reporting) to support automation using e.g. cron job ; we also suggest to create a technical user to be used for this kind of automation.

Automation: Permanent login with unlimited API KEY

If you automate some tasks using the cli, you should replace the JWT token in the ~/.labid/labidapi.yml with an API Key At the time of writing, only your admin can generate this key for you using the LabID Admin UI. Instructions to generate an API Key are available in the admin documentation.

We also suggest to create a technical user to operate these tasks, not a real user. This will avoid several issues when e.g. the user who set up the automation moves to another job.