Data Preparation
Common data preparation steps to perform in LabID.
Step 1: Prepare the Study's dataset content¶
The data to submit comprises sample descriptions, protocols and files (fastq or image files) associated with one Study (in LabID a Study is very similar to a study in ISA standards terms). One study should contain data of one experimental type only. For example, in the NGS case we don't mix data from ChIP-seq, with RNA-seq, HiC or DNA-seq so if you have to submit different experimental types, you will need to organize them into different studies. One exception to this rule is single cell multiome studies where e.g. the RNA-seq and ATAC-seq data is linked to the same study.
1. Make sure to have a study with only the datasets to be submitted¶
Your study should only contain the datasets to be submitted. If your study contains more datasets than the ones to be submitted, the easiest is to (1) create a new study and (2) associate all relevant datasets with it using the batch edit option.
From the study page containing more datasets than the ones to be submitted, select the datasets to be submitted and use the Batch Edit option to associate them with a new study.
Once in the batch edit component, select the new study to associate the selected datasets to.
We also encourage you to edit the QC flag on all datasets, in particular on unsubmitted datasets that might be "QC failed". For all failed cases, we encourage you to indicate the failure reason in the dataset description. You can also use the Batch Edit option to set the QC flag of multiple datasets.
2. Consolidate the study name and description¶
Make sure to have a good name and description for your study as these will be the public ones. Please note that the study description is different from your manuscript summary. The study description should explain what the study is about and its technical aspects (design, experimental factors and key processing aspects). It should not present scientific conclusions or be your paper abstract.
It is always a good idea to check the study description of other studies published in public repositories (or in LabID by filtering studies on e.g. the Array Express ID or ENA Study Acc annotations for NGS; or BioImage Archive Acc for images) to get an idea of details and the type of information that is expected.
A few examples:
- Comparison of ATAC seq between control wing imaginal discs and salm/salr knockdown wing imaginal discs
- Studies submitted by MODIS across years ie look for Charles Girardot in submitters
3. Position the study design terms, experimental factors and suggested annotations¶
Select all the relevant design terms. Design terms reflect how the study was designed and are therefore directly related to the experimental factors, where an experimental factor is a variable that is deliberately changed or measured in an experiment to study its effect on the outcome.
After selecting the design terms, add Experimental Factors and Suggested Annotations. Experimental factors are biological or technical variables that differentiate the samples or conditions in the study. Consequently, each selected design term should be reflected in the samples' annotations i.e., the samples (and their connected datasets) could be clustered into logical groups using the sample's annotations flagged as experimental factors.
In addition to the experimental factors, you should add all relevant annotation types as suggested annotations and position:
- The Object indicates which object should be annotated;
- For NGS we expect the sample annotations to be placed on SequencingLibrary items as they represent the actual input of the sequencing assay.
- For imaging data, sample annotations to be placed on Sample or EMSample items as they represent the actual input of the light microscopy and electron microscopy assays, respectively.
- It might be relevant to request annotations on other item types, for example, on Dataset to reflect key data analysis procedures that are under evaluation in the study.
- The Mandatoriness of the annotation will guide users (you!) to fill in the relevant information. Annotations representing the experimental factors must also be marked as mandatory.
Independently of the design terms/experimental factors, you should always annotate your samples with the following annotations (unless irrelevant) to comply to the FAIR principles:
- Sex (non-human samples) or Gender (human samples)
- Cell Line,
- Cell Type,
- StrainOrLine,
- Organism Part,
- Individual Genetic Characteristics (== Genotype),
- Genetic Modification (when samples are genetically modified),
- Developmental Stage,
- Age (with InitialTimePoint),
- Disease State,
- Individual.
Important: this information will be used to validate the study before submission e.g. when exporting MAGE-TAB document, a report is generated; we hope to move this soon directly in the UI.
Noticeable relationships (not exhaustive!) between experimental design and annotation types:
| Design term | Suggested annotations |
|---|---|
| Binding site identification design | Antibody, AntibodyType, AntibodySource & Epitope |
| Compound treatment design | TreatmentType, TreatmentTime, TreatmentConcentration, Compound & Compound Dose |
| Genetic modification design / Genotype design | IndividualGeneticCharacteristics (== Genotype) & GeneticModification |
| Time series design | Time or Age with InitialTimePoint or DevelopmentalStage |
| Strain or line design | StrainOrLine |
| Organism part comparison design | OrganismPart |
| Development or Differentiation design | DevelopmentalStage |
| ... | ... |
4. Study contributors¶
Set the contributors with their role(s) (ORCID can be omitted). Not all authors of the paper should be described but mainly the persons who prepared the submitted data. We advise adding the following contributors:
- Submitter: the person(s) who will be in charge of the submission process (usually a member of MODIS and the person(s) who generated the data)
- Co-investigator: the person who generated the data and helped prepare the data for submission
- Principal Investigator: the group leader. This person is usually the last author of the paper.
These persons should be able to answer questions about the data.
Step 2: Connect protocols to samples¶
Please make sure that all the samples directly linked to the study's datasets are connected to the relevant protocols. This includes:
- the samples direclty connected to the assays (primary data); for example sequencing libraries for NGS data, EMSamples for electron microscopy or samples for light microscopy
- the samples linked to processed datasets (when relevant).
For each protocol, it is important to fill in the Protocol’s summary field as this is what we export by default. This should contain a summary of the procedure, as opposed to the protocol’s description that may contain formatted text, tables, pictures… When you followed a commercial protocol or a published protocol, the summary may be as simple as “The RNA-seq library was prepared according to manufacturer’s indications from kit reference” or “The ChIP-seq library was prepared according to Bob et al. 2020, doi:10.1000/xyz123”."
Once all protocols are in LabID, you can easily link them to the samples, EMSample or sequencing libraries using the Batch Edit option (see below).
Expected protocols for NGS data¶
For NGS data, each sequencing library should be connected to the following protocols:
- one protocol of type “Culture & Growth”, describing how you collected the samples; mandatory
- one protocol of type “Extraction”, describing how you extracted the nucleic acid (DNA, RNA….); if not provided, this information must be reflected in the “Library Preparation” protocol (see below)
- one protocol of type “Library Preparation”, describing how the sequencing library was prepared; mandatory
More than one protocol of each type can be linked to the sequencing library; and protocol of other types (fixation, treatment, etc.) can be linked as well.
Expected protocols for image data¶
Please note that we are still working on the protocol model for imaging data and no clear guidelines are available. For imaging data, each sample or EMSample might be connected to the following protocols:
- one protocol of type “Culture & Growth”, describing how you collected the samples; mandatory
- one protocol of type “Labeling”, describing how you labeled the samples with different fluorophores or dyes
- one protocol of type “Fixation”, describing how you fixed the samples
and any other protocols that you find relevant that reflect how the imaged samples were prepared.
Step 3: Annotate the Samples¶
Annotate your samples, EMSample or sequencing libraries with all relevant information, e.g. as defined by the experimental factors and suggested annotations (defined earlier in the study). This is a crucial step as the samples are the actual input of the sequencing or imaging assays, and therefore the information on them will be used to generate the metadata for the submission.
Additional expectations for sequencing libraries (NGS)¶
In addition, the Source, Strategy, Selection and Orientation (paired-end libraries) properties of the Sequencing Library Preparation section must be filled in. The other properties found in the Sequencing Library Preparation section must be positioned when relevant (RNA-seq)
The easiest way to annotate the sequencing libraries is to:
- go to the sequencing library list page (under
biomaterialsmenu) and set the context filter to filter on the current study: only the sequencing libraries linked to the study to submit will then be listed - Use the online batch edit to set common properties (e.g. sequencing library preparation properties) and protocols; see this tutorial at step 6
- Use the Excel-based batch edition to annotate the sequencing libraries with all the relevant information; see this tutorial at step 3.
Additional information for NGS single-cell studies¶
For single cell studies, the sequencing libraries should be:
- Flagged as Is Single Cell in the Sequencing Library item and the relevant properties positioned
- Annotated with all relevant Read Layout xxx Slot annotations to expose the location in the reads of the UMI, Cell Barcode, Sample Barcode and cDNA.
The Read Layout xxx Slot annotations are used to describe the layout of the sequencing reads. The following slots are available:
- Read Layout Cell Barcode Slot: Position of the Cell Barcode
- Read Layout cDNA Slot: Position of the cDNA
- Read Layout cDNA 2 Slot: Second Position of the cDNA
- Read Layout UMI Slot: Position of the UMI, if any
- Read Layout Barcode Slot: Position of the Barcode, if any. Usually in index1 or index2 reads
- Read Layout Barcode 2 Slot: Second Position of the Barcode, if any, Usually in index1 or index2 reads
All these annotations accept values in the form "read/offset/length" where:
- read indicates the read name and must be one of: read1, read2, index1, index2.
- offset is an integer representing the offset from read start to the start of the slot (use 0 for no offset); and
- length is an integer describing the slot length ('*' can be used to represent variable length or 'till the end' of the read).
Example: "Read Layout cDNA Slot:read1/0/18" indicates the presence of an 18-bases long cDNA at the start of read1.
Additional expectations for EM Samples (Electron Microscopy)¶
The EMSample model is prosperous, and many aspects of the sample preparation can be provided by activating the relevant sections.
PS: Elecront microscopy submission guideline is still under development. More information will be provided in the future.
Help! My samples are duplicated¶
In certain situations, samples may be wrongly duplicated. This situation needs to be addressed, and ssamples need to be deduplicated (merged). A typical example is when an existing library is sequenced in two different runs and two existing libraries were generated instead of linking the second run to the existing library.
Merging samples is demonstrated in the following tutorial