Automating Assay Data Registration in the Vincent Lab¶
Scientists in a lab work on a wide variety of subjects and technics generating very heterogeneous data that requires custom registration with the import wizard. Projects involving the systematic processing of many samples with a stable workflow (fixed protocols, fixed equipment and assay setup) offer the possibility to automate the data and assay registration.
Here we demonstrate how data registration can be automated with a concrete project conducted with the Vincent Lab at EMBL Heidelberg.
Access the complete information and detailed setup
The complete how-to is available for the vincent users in this git repo [Authorized users only]
Scientific background¶
Diatoms are microscopic marine organisms that play a crucial role in the Earth's ecosystems by conducting about 20% of the planet's annual photosynthesis and supporting marine food chains. They have various tight relationships with other organisms, some of which are mutually beneficial, while others are harmful, all of which are vital for their growth and with an overall impact on the global environment. Surprisingly, even though we know of around 20,000 different diatom species, we have only discovered fewer than 200 of these relationships, indicating that there are many more waiting to be found. Recent research has pointed out that our methods for uncovering these diatom interactions have limitations. So, we believe that these diatom symbioses are more common than previously thought and are significant for diatom ecology and therefore our ecosystem’s functioning, especially in changing environments.
The Vincent Lab uses cutting-edge techniques to systematically unravel these tiny partnerships in natural marine samples, starting with diatoms. They use a treasure trove of samples collected during scientific expeditions on the Tara schooner, that have been preserved in a way that allows them to study the organisms and their partnerships without altering them, at the single cell level. The Vincent lab combines high-throughput feedback microscopy, image enabled cell sorting, and downstream single cell omics to resolve organismal taxonomy and thereby reveal the breadth of microbial symbioses in the open ocean.
To achieve their goals, the Vincent lab screens many TARA samples using the same workflow and protocols. Such set up constitutes the perfect use case to enable automated data registration in LabID to enforce data structure matches the data management plan, guarantee data traceability and reduce the time devoted to data management operations thereby releasing the team of the burden to manually register data.
Overview of the Tara sample processing
Data storage strategy¶
We opted for the following strategy:
- The original Tara samples are registered as Sample and the sorted plates as Sequencing Library, with each Sequencing Library plate being a child of a Tara Sample
- The image-enabled sorting assays and their digital outputs (experimental metadata, images...) should be loaded as Light Microscopy Assay. Each assay takes a single plate as input and defines the different digital outputs (experimental metadata, images...) as output datasets.
- The single cell sequencing should be loaded as Illumina Sequencing Assay with one or more plates as input, and the demultiplexed FastQ files as output datasets.
Mapping the sample processing workflow to LabID objects
tiff files, the channel profile txt files and the experimental metadata files (lmd, dcimg...) produced by the image-enabled sorting assays are registered as individual datasets.
Similarly, the fastQ files produced by the Illumina sequencing assay are registered as individual datasets.
Samples and Plates creation¶
All the Tara samples have been preloaded in LabID using excel-based batch import (also see this training). Each time a sample is sorted into a plate, a new sequencing library reflecting this plate should be manually created. This operation is easily done using the Create Child button available on the Tara sample page (training here (step 2)).
Automated registration of the sorting assays¶
To automatically register the image-enabled sorting assays and their digital outputs, we use the CLI register batch command with a custom sniffer plug-in which role is to parse the folder containing all the files created by the particle sorter. This implies that the content of this folder is always organised the same way, and that files/folders follow a predefined naming convention. Such requirements allow for automated data validation (all expected data found and named according to expectations) before its registration.
The implementation of the copas_vision_loader sniffer is visible in the LabID CLI project (plugins module)
The copas_vision_loader sniffer specifications (assay run folder organisation)
This sniffer expects to run on a folder containing the COPAS Vision results for one plate with the following structure:
- A 'COPAS_Exp' folder containing the original COPAS Vision files (.bxr4, .lmd, .txt, .dcimg)
- A 'Dispensed_TIFFS' folder containing 'expected_image_number' images. All images must be of the same format (e.g. tiff) as indicated by their file extension.
- A 'All_profiles' folder containing the exported txt channel profiles
Naming convention:
- The run folder (given as input) name must be the plate name and is validated using the 'sample_name_regex' (plate_name = sample_name_sorting-date_SCreplicate-number), where the sample_name is the plate's parent sample name, sorting-date is the date of the plate sorting (YYYYMMDD) and replicate-number is an incremetal integer. The corresponding SequencingLibrary (same name) must be found in LabID.
- The COPAS Vision file (.bxr4, .lmd, .txt, .dcimg) names are expected to start with the plate_name (plate_name.*.bxr4)
- The image file names must end with the well position (e.g F4)i.e.".+well_position.Tiff"
- The channel profile file names have the .txt extension_i.e._"chnumber.txt"
The automation aspect is achieved by automatically running the labid register batch command periodically (e.g. every hour) using a job scheduler like cron. A few aspects must be considered when setting up the periodic task:
- Which user should be used to connect to LabID?
This user is used to submit the data and therefore ends up as the data owner. Here we recommend to create a technical user with a permanent API key. This avoids to encounter potential permission issues when the user owning the data leaves the lab.
we created a new member of the Vincent Group named vincentagent.
- Where should the monitored directory be created?
As explained in the CLI register batch command use case, the assay run folders (see box above) must be placed in a monitored directory that is given to the labid register batch command using the --root-dir option. LabID requires this monitored directory to be in the dropbox of the user submitting the data.
we created a copasvision directory in the vincentagent dropbox.
- Which user should be used to execute the period task?
The user running the periodic task must be able to read the files dropped in the copasvision directory in the vincentagent dropbox. This user should therefore be a lab member. It makes sense to use a technical user to avoid issues when lab members leave the lab.
in our setup, we have a unique labidloader (fake name for security reasons) user to execute all the periodic tasks for all the groups.
Finally, the labid register batch period task should be installed on a stable server (you may ask your IT department for this) and run with the vincentagent user credentials using the --config-path option.
Automated registration of the sequencing assays¶
A similar approach (using the CLI register batch command) may be used to register the sequencing assays and data with a custom sniffer. However, at EMBL we already integrated the Genomics Core Facility with LabID such that sequencing assays are automatically pre-registered in LabID.
Overview of the sequencing data transfer set up at EMBL
Briefly, the Genomics Core Facility transfers the demultiplexed sequencing data together with assay metadata (as a JSON document) to the LabID data repository; and an Illumina Sequencing Assay in initialized state is automatically created in LabID on behalf on the data owner. The data owner receives an email containing a link to the initialized assay page where the user can finalize the assay registration using the interactive data registration wizard (pre-filled with information we received from Genomics Core Facility). In particular, the wizard allows to link datasets to samples and define the dataset-study relationships.
This means that, at EMBL, the plate sequencing results and details are automatically registered in LabID ; still the registration of these initialized assays must be completed_i.e._validated. This validation can also be automated using the CLI validate batch command and a custom sniffer. The CLI code base contains the PooledIlluminaInitializedAssayValidator sniffer that is extended to achieve the needed behavior:
- define the regular expression to validate the sample name
- enforce that the SequencingLibrary exists in LabID
- provide a custom implementation of the
def dir_qualifies(self, dir_path: Path)method (see sniffer documentation) used to first assess if the initialized assay should be processed by this sniffer. We indeed need to execute the sniffer only on sequencing assays that do qualify as a single cell sequencing plate; and ignore others assays without raising an error email.
the custom TRECSequencingPlateAssayValidator sniffer class is available as a plugin (trec_scplate_validator) in the LabID CLI project (plugins module)
The PooledIlluminaInitializedAssayValidator sniffer
Using the following command line, one can learn details about the PooledIlluminaInitializedAssayValidator
labid get sniffers -n TRECSequencingPlateAssayValidator
Sniffer: TRECSequencingPlateAssayValidator [AssaySniffer]
Supported technology: sequencing
Supported platforms: ['ILLUMINA']
Supported sniffer parameters:
- sample_name_regex : a regex with one group to extract the pooled table name (default: ^([\w]+202\d{5,5}SC\d).+$)
- sample_must_exist : True or False. If True the sample must exist in LabID (default: True)
- sample_format_regex : optional pattern to reformat the extracted pooled sample name. The new name is obtained by concatenating all matched groups with the sample_format_spacer. Default: ^([\w]+)(202\d{5,5})(SC\d).+$
- sample_format_spacer : spacer to use to concatenate all matched groups from sample_format_regex. Default: _
- dataset_per_sample : expected dataset number per plate (default: 96)
This sniffer validates a multiplexed, plate-based, single-cell assay pre-registred in the initialized state in LabID. A unique sequencing library representing all wells of the input plate is created and all datasets originating from this plate are linked to this 'POOL' library. Optionally, the plate (i.e. the sequencing library) must exist in LabID. This sniffer also expects the plate (i.e. the sequencing library) to be named according to a predefined pattern.
Note: the sample name validation regex can be passed using the CLI -p options: -p "sample_name_regex=^([\w]+202\d{5,5}SC\d).+$"
Here again, the automation aspect is achieved by automatically running the labid validate batch command periodically as the labidloader user and using the vincentagent user credentials to connect LabID
Enforcing naming convention and other pre-requisites¶
A sniffer extracts metadata about the assay, the sample names and their relationships with the datasets and represents this information using the LabID object model. Many options are possible when a sniffer inspects the data. For example, a sniffer may check that:
- the data is organized in the expected folder structure,
- the folders contain the expected number of files,
- the files are named according to the expected convention,
- items such that samples or instruments already exist in LabID, and reuse them instead of creating new ones
In this project, the assay data is not registered when the expected plate is not found as a registered sequencing library in LabID. Instead, an error is reported to the user (using the --error-email options of the CLI commands). This ensures that users follow the data management plan and naming conventions.