Writing Custom Sniffers¶
A Sniffer is a custom data parser in charge of the validation and the extraction of the metadata representing a list of
datasets (DatasetSniffer), or a list of datasets, its associated assay and samples (AssaySniffer).
The Sniffer represents this information as objects using the object model
available in models.py from the stocks module of the LabID CLI project
The different Sniffer types¶
The Sniffer Abstract Class Hierarchy
An DatasetSniffer is responsible to extract Dataset description from a target folder, optionally organizing them in DatasetCollection.
This is achieved by implementing the abstract sniff_datasets() method. The optional dir_qualifies() method may be overwritten to first decide if
the folder is suitable for the sniffer by returning a boolean.
When the DatasetSniffer's sniff_datasets() method is executed on a folder that does not match the sniffer expectations i.e. folder structure,
file/folder naming, missing expected information... an AssayStructureError should be raised.
When sniff_datasets() is executed on folder that does match the sniffer expectations but no Dataset can be
extracted (i.e. some filtering conditions on e.g. ownership ...), an empty list should be returned.
The AssaySniffer abstract class extends the DatasetSniffer abstract class.
An AssaySniffer is responsible to parse run folders and extract Assay and Dataset description corresponding to
one InstrumentRun (although the method signature suggests multiple runs could be returned, this is not yet supported).
The key method to implement is the sniff_instrument_run_assays() while the sniff_datasets() inherited from the
DatasetSniffer is implemented to return an empty list by default.
When the AssaySniffer's sniff_instrument_run_assays() method is executed on a folder that does not match the sniffer
expectations i.e. folder structure, file/folder naming, missing expected information... an AssayStructureError should be raised.
When sniff_instrument_run_assays() is executed on folder that does match the sniffer expectations but no Assay nor InstrumentRun can be
extracted (i.e. some filtering conditions on e.g. ownership ...), an empty list should be returned.
Overview of the object model¶
Overview of the LabID Object Model
Assay belongs to a single InstrumentRun or has a unique InstrumentModel. The Assay contains one or many
Dataset that is linked to one or many Sample. Each Dataset is made of one or more DatasetFile and may belong to a
unique DatasetCollection. A DatasetFile contains one or more DatasetFileCopy (i.e. an actual URI of the DatasetFile).
The Assay class should never be used i.e. the proper assay subclass must be instanciated. When no information about the actual
instrument used to create the assay is available, an instrument model can be set instead of an instrument run.
While the Sample class can be used directly, you must still use the proper sub-type depending on your assay type.
For example, SequencingAssay types require SequencingLibrary as input and the *EMAssay require EMSample as input.
In the context of a sniffer, a unique DatasetFileCopy is expected for each DatasetFile ; also the overall set of Sample
should be given to the Assay (as a Dict[str, Sample] where the key is the sample name, implying each sample name should be
unique in the context of an assay)
Available sniffers¶
One can learn which sniffers are available in their installation using labid get sniffers:
> labid get sniffers
Sniffer: NanoporeAssaySniffer [AssaySniffer]
Sniffer: GeneCoreAssaySniffer [AssaySniffer]
Sniffer: SeqStrandAssayValidator [AssaySniffer]
Sniffer: TRECSequencingPlateAssayValidator [AssaySniffer]
Sniffer: CopasVisionAssayLoader [AssaySniffer]
and get details about a particular sniffer with labid get sniffers -n NanoporeAssaySniffer
Sniffer Detail Example: NanoporeAssaySniffer
> labid get sniffers -n NanoporeAssaySniffer
Sniffer: NanoporeAssaySniffer [AssaySniffer]
Supported technology: sequencing
Supported platforms: ['NANOPORE']
This sniffer looks into a directory expecting the native project_id structure created by nanopore sequencer:
- PROJECT_FOLDER => an option project_id folder regrouping multiple samples & runs
- LIBRARY/SAMPLE_FOLDER(s) => an optional library (potentially multiplexed) folder containing
results from 1 or more runs (tech replicates)
- RUN_FOLDER(s)
- barcode_alignment_* file: contains a single line when the sample is not a multiplexed library.
MANDATORY
- final_summary_* file: key=value summary of main parameters.
MANDATORY
- report_*.md file: a multi-section file with an initial JSON session holding all but more
params than final_summary_* file
MANDATORY
- duty_time_* file: we will assume this optional (as we dont use it)
- throughput_* file: we will assume this optional (as we dont use it)
- other_reports/*csv: additional files generate during run. We wont assume anything here
If base calling was OFF:
- fast5 | pod5 : a directory containing the raw fast5/pod5 signal. MANDATORY
If base calling was ON:
- sequencing_summary_* file: a big file listing all the reads sequenced and in which fastQ/fast5
file(s) they are in, this file is multi Gb big.
MANDATORY when base calling is true
- (fast|pod)5_fail & (fast|pod)5_pass: directories containing the fast5/pod5 files for failed
and passed reads.
A sub-directory layer 'barcode01-12' and 'unclassified' is present
if the library is multiplexed.
one of the two is MANDATORY
- fastq_fail & fastq_pass: optional directories (present if base calling was done) containing
the fastq files for failed and passed reads.
The run folder detection occurs by looking for the presence of files matching the pattern
'final_summary_*.txt' and/or 'report_*.md' (both must be found).
Each directory containing such a file will be parsed into a nanopore assay.
When the fast5/pod5 (and fastq) directories contain more than one file, the directories are registered as
'multi-fast5'/'multi-pod5'/'multi-fastq' dataset directories. When unique files are found per dir_path,
there are registered as pod5/fast5/fastq dataset files.
Each '(fast|pod)5', '(fast|pod)5_pass'/'(fast|pod)5_fail', 'fastq_pass' and 'fastq_fail'
will end up as a different dataset collection when found.
The 'sequencing_summary_* file' and verbose other metadata files are also registered as a dataset
An Assay is created for each *run*.
The instrument serial number (matching the instrument's code in LabID) is taken from the metadata
'sequencer_serial_number' when present. When absent, the serial number will be created as
'<device_type>-<host_product_serial_number>' to differenciate between gridion and p2 models (as both would
indicate the same 'host_product_serial_number')
Using Sniffers¶
Sniffers are used with different commands of the CLI; for example