Registration of Raw Datasets 101 - Imaging data ¶
- 30 min
- Moderate
Overview
This hands-on accompanies users through the different steps necessary to load raw imaging datasets. This includes:
- Creating a new imaging assay using assay templates
- Retrieving data from the user dropbox
- Navigating through the dataset loader wizard
- Linking datasets to
Study(ies)andSamples
Reviewing the different registered - and affected - entities (datasets, datafile copies, studies, projects, samples, ...) is done in the following training Post registration Operations
Concepts
- Dataset
- «Dataset» is generally a broad concept meant to refer to any file, set of files, folder, or set of folder, describing a meaningful unit of biological data about some biological material for a given condition and/or treatment. A file - or a directory, etc. - containing data is referred to as a datafile. In many cases, a single datafile is enough to constitute the dataset, i.e. this single entity detains all parts of a meaningful unit of data about a sample. Conversely, in other cases, the meaningful unit is a combination of datafiles
- For example: In single-end sequencing, a dataset is composed of a single file containing all short DNA sequences. Conversely, in paired-end sequencing, the meaningful unit of data is a combination of two files, each containing a part of the information.
- Raw dataset
- A raw dataset refers to the primary dataset acquired - during an assay - by «reading-through» some biological material, using a specialised scientific instrument. Ideally, the raw data is presented in the same exact format as outputted by the instrument. Raw formats vary among instruments and disciplines (e.g. sequencers commonly output
FastQ1 files, the raw text format used to store short DNA sequences). - Assay
- An assay is a recording session during which some biological material (or sample) is «read-through» using a scientific instrument to obtain the raw data
The datasets are linked to the data-generating assay and to their origin samples.
Walkthrough
In this walkthrough, we will be registering raw imaging datasets. These have been acquired from biomaterials using a microscope.
Step 1. Enter the dataset loader¶
From the left menu of the application: Click Import datasets then Import datasets
Start the dataset loader
You now have been redirected to the Dataset Loader Wizard.
The wizard assists users through the different steps needed to register datasets. All pages of the wizard display a progress map at the top. There are at most six steps: Select Assay Type, Select Data, Create Assay, Build Datasets, Verify and Assign Samples.
The first page of the wizard is the Select Assay Type page.
Step 2. Select the assay type (and template) - Wizard (1/6)¶
From the Assay Type Selection, you need to select the type of assay corresponding to the type of data you are registering. You can also select an assay template to speed up the assay creation process.
Using templates
New assay creation from within the registration wizard does not have to be done from scratch, instead one can choose to generate the new assay from:
- a template assay - the sole purpose of the template assay is to be re-used as a base for new assays,
- an existing assay, i.e. a previously registered assay,
In such case, the new assay is initialised with values from the template. Unique values (e.g flowcell information) cannot be transferred have to be provided by the user.
- From the Light Microscopy Assays panel, select the Light Microscopy assay template named «
Standard Nuclei Imaging (Light Microscopy Assay)»
- This opens up a preview of the template where the assay information can be reviewed.
Light Microscopy Assays panel on the Assay Type Selection Page
- Click Continue with template at the bottom right of the page.
You now have been redirected to the Select data page.
Step 3. Select data - Wizard (2/6)¶
On the Select data wizard page, you select the files and/or folders (datasets) that have to be loaded for the assay.
Using the Dropbox at EMBL
- The Dropbox approach is used to streamline data acquisition, mainly by circumventing heavy upload and long data moving time. In this hands-on, we use a serverless deployment. In such a setup, users cannot upload data in dropboxes, the data has therefore already been placed in there. However, our production instance has access to group shares, where managed folders were deployed on setup. This means users can directly browse to the Dropbox and drop their data there. This data will then be accessible from within the dataset loader.
Choosing the right dropbox: A user can belong to more than one group. Dropboxes are located on group shares. When a user belongs to more than one group, they will own more than one dropbox, one on each group share. It is very important to upload the data to the dropbox of the group that will own the data.
-
- (1) Make sure the dropbox folder of your group is selected. Be sure to select the group that will be the data owner (see «Using the Dropbox» above).
- This loads the file explorer with files information from your dropbox.
-
- (2) Select the folder containing raw datasets (i.e. the «run» folder of the assay). In our example, the folder is named
raw-imaging - This populates the right panel of the file explorer with files information from the run folder
- (2) Select the folder containing raw datasets (i.e. the «run» folder of the assay). In our example, the folder is named
Wizard - Select Data
- (3) Select the 3 folders as datasets (
plate2well1,plate2well2, andplate2well3). To do this, either click the checkbox before each folder once, or click only the parent checkbox twice.
Selecting individual files vs folders
Selecting individual files is recommended when each CZI file should become an individual dataset. Selecting folders is recommended when each folder should become a dataset, i.e. when each folder contains multiple files that together constitute a dataset.
Creating a separate dataset per CZI file is recommended when each CZI file may be managed individually; for example when the individual files can be deleted or shared individually, or when each file corresponds to a different sample. Conversely, creating a dataset per folder is recommended when the files in each folder are tightly linked and should always be managed together; for example when the files in each folder correspond to different channels of the same acquisition, or when the files in each folder correspond to different z-slices of the same acquisition.
In this hands-on, we assume each folder contains multiple files that together constitute a dataset.
- When done with selecting datasets, click Continue at the bottom right of the page.
You now have been redirected to the New Assay page. This has been initialised with the information extracted from the template
Step 4. Create the new assay - Wizard (3/6)¶
On the New Assay page:
- Update the name of the assay to make it different and unique. The key idea is to be able to find it easily later.
New Assay Page
- Click Continue
You now have been redirected to the Build Datasets page.
Step 5. Stage datasets - Wizard (4/6)¶
The Dataset Builder panel on the Build datasets page presents a default collection of raw datasets and lists all datasets that have been selected on the previous page. This builder is used to assemble datasets into multipart datasets.
Why is this step needed ?
The dataset builder is a powerful tool to build multipart ( or multi-dimensional) datasets (i.e datasets composed by two or more dataset files) and organise datasets into different collections.
It can also be used to adjust the dataset names with regex. Advanced dataset builder features are addressed in other parts of the training material.
In this walkthrough, the data is simple - Datasets have one dimension and all datasets are collected into a single collection - therefore, these features can here be overlooked.
- Name the collection: e.g.
Light Microscopy Images - Training
Wizard - Build Datasets - «Dataset Builder» Panel
- As raw data is being registered to a new assay, it is mandatory to have at least one raw data collection.
-
- Click Stage All
- Once staged, datasets appear in the Staged datasets bottom panel, where their names can manually be adjusted.
-
- Improve the default dataset name by removing
plate2from the names - Use nice names, they will later be used to identify the datasets in the system (e.g. for searching). Note that these are the dataset names, not the name of the folders or files on disk. File or folder names on disk cannot be changed here (nor after the registration), and should be modified prior to registration if needed.
- Improve the default dataset name by removing
Wizard - Build Datasets - «Staged Datasets» Panel
- Register individual CZI files into the datasets. Here, each dataset is a folder containing multiple CZI files. We could register the datasets without further modifications (each dataset would then contain a single datafile of type folder). Alternatively we could register each CZI file as a separate datafile such that each dataset contains multiple datafiles of type CZI file. This later operation is called expanding a dataset and is done using the Expand button. When clicking this button, all files matching the pattern given in the attached text box (
*.cziin the picture below -orange box-) contained in the folder are registered as individual datafiles within the dataset.
Wizard - Build Datasets - «Expanded Datasets»
*.czi (orange box). In the picture, the top dataset was already expanded using a *.czi pattern and the Datafiles column lists the identified files instead of the parent folder name.
- After you expanded all three datasets, you should see something like this:
Wizard - Build Datasets - Resulting view
- When ready, click Continue
You now have been redirected to the fifth page of the wizard, the Verify page.
Step 5. Verify - Wizard (5/6)¶
The Verify page is here mainly to double check the information that has been assembled so far. Importantly, it is also the time to indicate:
- The Study for which the datasets have been generated,
- Whether samples should be assigned to the datasets,
- A sample type.
Here we are loading raw data associated with an assay, therefore samples have to be assigned. The sample type cannot be modified as it is constrained by the Light Microscopy assay type.
- Create a study:
-
- Click Create
- This opens a pop-up with a «New Study» form.
-
Fill in the study form
- Name:
Study of TraineeXX - Project:
Tea/Coffee Project - (Optionally) Fill in any other field.
- Name:
-
- Click Create Study to confirm and close the pop-up
- The new study now appears selected in the Study dropdown
-
Wizard - Verify
Light microscopy study - Training». Since we are registering raw data, assigning samples is mandatory. The sample type cannot be modified as it is constrained by the Light Microscopy assay type/figcaption>
- When ready, click Continue
You now have been redirected to the final page of the wizard, the Assign Samples page.
Step 6. Assign samples - Wizard (6/6)¶
The Assign Samples page is used to assign a sample to every dataset.
Create new samples or create existing ones
There are two different situations: Either the samples already exist (.e.g they have been created previously, within an experiment, or they have been created in the past within a different project but are utilized again to acquire new information), or the samples do not exist. In case the samples do not exist, they can be automatically created. By default, the dataset name is proposed as the sample name, but it can be adapted as demonstrated below.
Also on this page:
- The owner of each dataset can be modified to indicate whether the data is loaded for different users.
- The Study can also be adjusted per dataset, in case more than one study was selected on the verify page.
Wizard - Assign Samples
Here we will let new samples to be automatically created. Click the Submit button.
You now have been redirected to a confirmation page stating the data is being loaded: Success
The data is now registered and available in LabID. To learn how to find and modify this data, follow the post-registration operations hands-on.
Continue to dataset registration hands-on (Register Raw Datasets 102) to learn how to create samples prior to registration and assemble datafiles into multidimensional datasets with the dataset builder.
-
«FASTQ format» article on Wikipedia: https://en.wikipedia.org/wiki/FASTQ_format ↩