Skip to content

Registration of Raw Datasets 101 - Imaging data

  • 30 min
  • Moderate
Overview

This hands-on accompanies users through the different steps necessary to load raw imaging datasets. This includes:

  • Creating a new imaging assay using assay templates
  • Retrieving data from the user dropbox
  • Navigating through the dataset loader wizard
  • Linking datasets to Study(ies) and Samples

Reviewing the different registered - and affected - entities (datasets, datafile copies, studies, projects, samples, ...) is done in the following training Post registration Operations

Concepts
Dataset
«Dataset» is generally a broad concept meant to refer to any file, set of files, folder, or set of folder, describing a meaningful unit of biological data about some biological material for a given condition and/or treatment. A file - or a directory, etc. - containing data is referred to as a datafile. In many cases, a single datafile is enough to constitute the dataset, i.e. this single entity detains all parts of a meaningful unit of data about a sample. Conversely, in other cases, the meaningful unit is a combination of datafiles
For example: In single-end sequencing, a dataset is composed of a single file containing all short DNA sequences. Conversely, in paired-end sequencing, the meaningful unit of data is a combination of two files, each containing a part of the information.
Raw dataset
A raw dataset refers to the primary dataset acquired - during an assay - by «reading-through» some biological material, using a specialised scientific instrument. Ideally, the raw data is presented in the same exact format as outputted by the instrument. Raw formats vary among instruments and disciplines (e.g. sequencers commonly output FastQ 1 files, the raw text format used to store short DNA sequences).
Assay
An assay is a recording session during which some biological material (or sample) is «read-through» using a scientific instrument to obtain the raw data

The datasets are linked to the data-generating assay and to their origin samples.

Walkthrough

In this walkthrough, we will be registering raw imaging datasets. These have been acquired from biomaterials using a microscope.

Step 1. Enter the dataset loader

From the left menu of the application: Click Import datasets then Import datasets

Start the dataset loader

Import raw datasets: Open the raw dataset loader by clicking «Import raw datasets» in the left menu

You now have been redirected to the Dataset Loader Wizard.

The wizard assists users through the different steps needed to register datasets. All pages of the wizard display a progress map at the top. There are at most six steps: Select Assay Type, Select Data, Create Assay, Build Datasets, Verify and Assign Samples.

The first page of the wizard is the Select Assay Type page.

Step 2. Select the assay type (and template) - Wizard (1/6)

Wizard map
The wizard progress map at step 1

From the Assay Type Selection, you need to select the type of assay corresponding to the type of data you are registering. You can also select an assay template to speed up the assay creation process.

Using templates

New assay creation from within the registration wizard does not have to be done from scratch, instead one can choose to generate the new assay from:

  • a template assay - the sole purpose of the template assay is to be re-used as a base for new assays,
  • an existing assay, i.e. a previously registered assay,

In such case, the new assay is initialised with values from the template. Unique values (e.g flowcell information) cannot be transferred have to be provided by the user.

  • From the Light Microscopy Assays panel, select the Light Microscopy assay template named «Standard Nuclei Imaging (Light Microscopy Assay)»
This opens up a preview of the template where the assay information can be reviewed.
Light Microscopy Assays panel on the Assay Type Selection Page

Light Microscopy Assays panel with a template selected. The template preview is displayed below the template selector (top orange arrow). The create button appears below (bottom orange arrow)

You now have been redirected to the Select data page.

Step 3. Select data - Wizard (2/6)

On the Select data wizard page, you select the files and/or folders (datasets) that have to be loaded for the assay.

Using the Dropbox at EMBL
The Dropbox approach is used to streamline data acquisition, mainly by circumventing heavy upload and long data moving time. In this hands-on, we use a serverless deployment. In such a setup, users cannot upload data in dropboxes, the data has therefore already been placed in there. However, our production instance has access to group shares, where managed folders were deployed on setup. This means users can directly browse to the Dropbox and drop their data there. This data will then be accessible from within the dataset loader.

Choosing the right dropbox: A user can belong to more than one group. Dropboxes are located on group shares. When a user belongs to more than one group, they will own more than one dropbox, one on each group share. It is very important to upload the data to the dropbox of the group that will own the data.

  • (1) Make sure the dropbox folder of your group is selected. Be sure to select the group that will be the data owner (see «Using the Dropbox» above).
    This loads the file explorer with files information from your dropbox.
  • (2) Select the folder containing raw datasets (i.e. the «run» folder of the assay). In our example, the folder is named raw-imaging
    This populates the right panel of the file explorer with files information from the run folder
Wizard - Select Data

The Wizard «Select Data» view with a dropbox selected (orange arrow 1), run folder selected (orange arrow 2), 3 folders selected (orange arrows 3).

  • (3) Select the 3 folders as datasets (plate2well1, plate2well2, and plate2well3). To do this, either click the checkbox before each folder once, or click only the parent checkbox twice.
Selecting individual files vs folders

Selecting individual files is recommended when each CZI file should become an individual dataset. Selecting folders is recommended when each folder should become a dataset, i.e. when each folder contains multiple files that together constitute a dataset.

Creating a separate dataset per CZI file is recommended when each CZI file may be managed individually; for example when the individual files can be deleted or shared individually, or when each file corresponds to a different sample. Conversely, creating a dataset per folder is recommended when the files in each folder are tightly linked and should always be managed together; for example when the files in each folder correspond to different channels of the same acquisition, or when the files in each folder correspond to different z-slices of the same acquisition.

In this hands-on, we assume each folder contains multiple files that together constitute a dataset.

  • When done with selecting datasets, click Continue at the bottom right of the page.

You now have been redirected to the New Assay page. This has been initialised with the information extracted from the template

Step 4. Create the new assay - Wizard (3/6)

On the New Assay page:

  • Update the name of the assay to make it different and unique. The key idea is to be able to find it easily later.
New Assay Page

New Assay page: Modify the name to make it findable (lower orange arrow). Save by clicking the Save Item button (upper orange arrow)

You now have been redirected to the Build Datasets page.

Step 5. Stage datasets - Wizard (4/6)

The Dataset Builder panel on the Build datasets page presents a default collection of raw datasets and lists all datasets that have been selected on the previous page. This builder is used to assemble datasets into multipart datasets.

Why is this step needed ?

The dataset builder is a powerful tool to build multipart ( or multi-dimensional) datasets (i.e datasets composed by two or more dataset files) and organise datasets into different collections.

It can also be used to adjust the dataset names with regex. Advanced dataset builder features are addressed in other parts of the training material.

Dataset builder - Advanced controls

In this walkthrough, the data is simple - Datasets have one dimension and all datasets are collected into a single collection - therefore, these features can here be overlooked.

  • Name the collection: e.g. Light Microscopy Images - Training
Wizard - Build Datasets - «Dataset Builder» Panel

Wizard - Build Datasets - «Dataset Builder» Panel The dataset builder panel displays collection information. The default name was updated to something more readable. Raw data is ticked. The page displays errors until some dataset are staged.

  • As raw data is being registered to a new assay, it is mandatory to have at least one raw data collection.
  • Click Stage All
    Once staged, datasets appear in the Staged datasets bottom panel, where their names can manually be adjusted.
  • Improve the default dataset name by removing plate2 from the names
    Use nice names, they will later be used to identify the datasets in the system (e.g. for searching). Note that these are the dataset names, not the name of the folders or files on disk. File or folder names on disk cannot be changed here (nor after the registration), and should be modified prior to registration if needed.
Wizard - Build Datasets - «Staged Datasets» Panel

The Staged Datasets panel displays the datasets per collection, with their datasets. Here, 3 datasets are staged, with one datafile (folder) each. The names were manually updated to remove the «plate2» prefix

  • Register individual CZI files into the datasets. Here, each dataset is a folder containing multiple CZI files. We could register the datasets without further modifications (each dataset would then contain a single datafile of type folder). Alternatively we could register each CZI file as a separate datafile such that each dataset contains multiple datafiles of type CZI file. This later operation is called expanding a dataset and is done using the Expand button. When clicking this button, all files matching the pattern given in the attached text box (*.czi in the picture below -orange box-) contained in the folder are registered as individual datafiles within the dataset.
Wizard - Build Datasets - «Expanded Datasets»

Expanding dataset folders into individual datafiles. Datasets can be individually expended using the Expand button and a custom pattern such as *.czi (orange box). In the picture, the top dataset was already expanded using a *.czi pattern and the Datafiles column lists the identified files instead of the parent folder name.

  • After you expanded all three datasets, you should see something like this:
Wizard - Build Datasets - Resulting view

Result after expanding each dataset into mutliple czi files.

You now have been redirected to the fifth page of the wizard, the Verify page.

Step 5. Verify - Wizard (5/6)

The Verify page is here mainly to double check the information that has been assembled so far. Importantly, it is also the time to indicate:

  • The Study for which the datasets have been generated,
  • Whether samples should be assigned to the datasets,
  • A sample type.

Here we are loading raw data associated with an assay, therefore samples have to be assigned. The sample type cannot be modified as it is constrained by the Light Microscopy assay type.

  • Create a study:
    • Click Create
      This opens a pop-up with a «New Study» form.
      • Fill in the study form

        • Name: Study of TraineeXX
        • Project: Tea/Coffee Project
        • (Optionally) Fill in any other field.
      • Click Create Study to confirm and close the pop-up
        The new study now appears selected in the Study dropdown
Wizard - Verify

The verify page displays a sum-up of the information assembled so far about dataset(s), collection(s), study(ies) and sample type. Here, we have selected the study named «Light microscopy study - Training». Since we are registering raw data, assigning samples is mandatory. The sample type cannot be modified as it is constrained by the Light Microscopy assay type/figcaption>

You now have been redirected to the final page of the wizard, the Assign Samples page.

Step 6. Assign samples - Wizard (6/6)

The Assign Samples page is used to assign a sample to every dataset.

Create new samples or create existing ones

There are two different situations: Either the samples already exist (.e.g they have been created previously, within an experiment, or they have been created in the past within a different project but are utilized again to acquire new information), or the samples do not exist. In case the samples do not exist, they can be automatically created. By default, the dataset name is proposed as the sample name, but it can be adapted as demonstrated below.

Also on this page:

  • The owner of each dataset can be modified to indicate whether the data is loaded for different users.
  • The Study can also be adjusted per dataset, in case more than one study was selected on the verify page.
Wizard - Assign Samples

The Assign Samples page displays a table where each row displays one dataset. New samples will be created for all datasets; by default, the dataset name is proposed as the sample name, but it can be adapted (orange arrow). The same study and owner were associated with all datasets.

Here we will let new samples to be automatically created. Click the Submit button.

You now have been redirected to a confirmation page stating the data is being loaded: Success

Wizard - Registration Success

Successful registration sumup message

The data is now registered and available in LabID. To learn how to find and modify this data, follow the post-registration operations hands-on.

Continue to dataset registration hands-on (Register Raw Datasets 102) to learn how to create samples prior to registration and assemble datafiles into multidimensional datasets with the dataset builder.


  1. «FASTQ format» article on Wikipedia: https://en.wikipedia.org/wiki/FASTQ_format