Skip to content

Hands-on - Registration of Raw Datasets 102 - Sequencing Data

  • 40 min
  • Moderate

Overview

This hands-on is a follow-up of Hands-on - Register Raw Datasets 101 - we strongly advise to start with Hands-on 101 if this is the first time you are exposed to data registration.

Hands-on 102 accompanies users through the different steps necessary to load raw sequencing datasets. This includes:

  • Creating samples from an experiment
  • Creating a new sequencing assay from scratch
  • Retrieving data from the user dropbox
  • Navigating through the dataset loader wizard
  • Registering raw and derived datasets
  • Assembling multidimensional datasets with the dataset builder
  • Linking datasets to Study(ies) and existing Samples
Concepts
Dataset
«Dataset» is generally a broad concept meant to refer to any file, set of files, folder, or set of folders, describing a meaningful unit of biological data about some biological material for a given condition and/or treatment. A file - or a directory, etc. - containing data is referred to as a datafile. In many cases, a single datafile is enough to constitute the dataset, i.e. this single entity detains all parts of a meaningful unit of data about a sample. Conversely, in other cases, the meaningful unit is a combination of datafiles
For example: In single-end sequencing, a dataset is composed of a single file containing all short DNA sequences. Conversely, in paired-end sequencing, the meaningful unit of data is a combination of two files, each containing a part of the information.
Raw dataset
A raw dataset refers to the primary dataset acquired - during an assay - by «reading-through» some biological material, using a specialised scientific instrument. Ideally, the raw data is presented in the same exact format as outputted by the instrument. Raw formats vary among instruments and disciplines (e.g. sequencers commonly output FastQ 1 files, the raw text format used to store short DNA sequences).
Derived dataset
A derived dataset refers to a dataset that has been obtained by processing raw data.
Assay
An assay is a recording session during which some biological material (or sample) is «read-through» using a scientific instrument to obtain the raw data
Dataset Collection
Datasets originating from the same assay can describe different aspects of the analysed samples, as well as have different properties, and/or formats. For example, a sequencer can yield both raw data (in the BCL format, and/or FastQ format) and processed data (aligned reads in the BAM or SAM format). For convenience, datasets can be separated into different collections to be handled independently (raw data is e.g. archived and processed data is used for downstream analysis).

The datasets are linked to the data-generating assay and to their origin samples. Both the assay and samples can either be generated beforehand or while importing the datasets.

Walkthrough

In this walkthrough, we will be registering raw sequencing datasets. These have been acquired from biomaterials using a sequencer. We will also add a collection of derived dataset (obtained by processing the raw data).

Unlike in the first registration hands-on - where we have let samples to be generated automatically - we here want to use existing samples (of type Sequencing Library). This is to demonstrate the situation where one needs to describe how the libraries were obtained (in a ELN note). Libraries are created from the ELN note; when the datasets are registered, they are link to these libraries.

In real life, days or weeks can pass between the moment you prepared the sequencing library and the moment you receive the sequenced data. In order to better track sample-preparation information, we advise to record it immediately i.e. do not wait until you receive the sequencing data.

Step 1. Create samples from an Experiment

  • Create a new experiment named Sample Generation for 102 by traineeXX (replace XX with your trainee number)
    Creating experiments already has been covered in another section of this training. Please follow the instruction available in the hands-on «Electronic Lab Notebook - Explore lab notes»
  • Scroll to the bottom of the Experiment Detail page to reach the «Sample Editor» panel
  • Create 4 output samples of type Sequencing Library as shown in the picture below.
    Creating samples from experiments using the Sample Editor already has been covered in another section of this training. Please follow the instruction available in the hands-on «Sample Management - Sample Editor»
    • Name the samples xyz1, xyz2, xyz3 and xyz4
    • Assign the value NA to all barcodes. NA is accepted when you do not have the barcode information.
      Make sure you selected the right sample type at the top: Sequencing Library, or the barcode column will be missing.
    • Set the organism to your preferred on e.g. E. coli
    • Pick your group project (Tea or Coffee Project)
    • Click Add
    • Stage All samples
    • Save and close the Sample Editor pop-up
Sample editor with 4 output Sequencing Libraries

4 output samples of type Sequencing Libraryin the sample editor

Step 2. Enter the dataset loader

From the left menu of the application: Click Import datasets then Import raw datasets

Start the dataset loader

Import raw datasets: Open the raw dataset loader by clicking «Import raw datasets» in the left menu

You now have been redirected to the Dataset Loader Wizard.

The wizard assists users through the different steps needed to register datasets. All pages of the wizard display a progress map at the top. There are at most six steps: Select Assay Type, Select Data, Create Assay, Build Datasets, Verify and Assign Samples.

The first page of the wizard is the Select Assay Type page.

Step 3. Select the Illumina Sequencing Assay Type - Wizard (1/6)

Wizard map
The wizard progress map at step 1

From the Assay Type Selection, you need to select the type of assay corresponding to the type of data you are registering.

  • Locate the Illumina Assay row within the Sequencing Assays panel
    Do not select a template, nor an existing assay as template. Here we will be creating it from scratch
  • Click Continue without template

Sequencing Assays panel on the Assay Type Selection Page

Sequencing Assays panel with the left orange arrow pointing to the Illumina Sequencing line.

You now have been redirected to the Select Data page.

Step 4. Select data - Wizard (2/6)

On the Select data wizard page, you select the files and/or folders (datasets) that have to be loaded for the assay, and will be used to create multi-dimensional datasets (paired-end). Here we will select both the raw data and some derived data. The derived data is a set of BAM 2 files that have been obtained by aligning the raw data to a reference genome.

Using the Dropbox at EMBL
The Dropbox approach is used to streamline data acquisition, mainly by circumventing heavy upload and long data moving time. In this hands-on, we use a serverless deployment. In such a setup, users cannot upload data in dropboxes, the data has therefore already been placed in there. However, our production instance has access to group shares, where managed folders were deployed on setup. This means users can directly browse to the Dropbox and drop their data there. This data will then be accessible from within the dataset loader.

Choosing the right dropbox: A user can belong to more than one group. Dropboxes are located on group shares. When a user belongs to more than one group, they will own more than one dropbox, one on each group share. It is very important to upload the data to the dropbox of the group that will own the data.

  • (1) Select the dropbox folder of your group in the selector. Be sure to select the group that will be the data owner (see «Using the Dropbox» above).
    This loads the file explorer with files information from your dropbox.
  • Select the datasets:

    • (2) Select the folder containing the data: sequencing
      Selecting the folder populates the right panel of the file explorer with files information from the run folder
    • (3a) Select the 4 aligned-data datasets (.bam)

    • (3b) Select the 8 raw-sequencing datasets (.txt.gz)
      The blue tag at the top right shows 12 datasets are selected
Wizard - Select Data

The Wizard «Selet Data» view with a dropbox selected (orange arrow 1), run folder selected (orange arrow 2), 12 datasets selected.

You now have been redirected to the New Assay page

Step 5. Create the new assay - Wizard (3/6)

On the New Assay page, fill the form up with all relevant information about the assay, namely:

  • Name the assay: Training paired-end illumina seq assay of traineeXX (replace XX with your trainee number)
  • Pick Multiplexed Yes and set 4 for the sample number
  • Select the Instrument Model: Illumina NovaSeq 6000
  • Write down the Flowcell ID: e.g FLOWCELLXX (where XX is your trainee number)
  • Select the run type «paired-end»
New Assay Page

New Illumina Sequencing page. The blue arrow indicates the run type selector set to «paired-end». A flowcell also has to be indicated, as this is a mandatory field

You now have been redirected to the second page of the wizard, the Build datasets page.

Step 5. Build & Stage datasets - Wizard (4/6)

The Dataset Builder panel on the Build datasets page presents a default collection of raw datasets and lists all datasets that have been selected on the previous page.

Why is the dataset builder needed ?

The dataset builder is a powerful tool to build multi-dimensional datasets (i.e datasets composed by two or more dataset) and organise datasets into different collections. It can also be used to adjust the dataset names with regex. Advanced dataset builder features are addressed in other parts of the training material. In this walkthrough, we create one-dimensional datasets, therefore, these features can here be overlooked.

The Dataset Builder is used here to:

  • Assemble two datasets ( the read_1 and the read_2 ) into a two-dimensional dataset ( the paired-end dataset ).
  • Use a regular expression (regex) to extract the relevant parts of filenames and automatically assign a better name for each datasets (Advanced controls - Dataset Name Extractor).

Step 5 - 1. Collection information

Dataset builder - Collection information

By default, the dataset builder is initialised with one collection. This collection is set to have 2 datasets per row (a paired-end dataset is composed by both read 1 and read 2). Raw data is set to true. The assay type is set to NGSILLUMINAASSAY

A mandatory collection of raw data is expected for new assays

Data is being loaded to a new assay - i.e. an assay that does not already have any dataset registered to it - it is therefore mandatory to submit at least one collection of raw data. Additional collection can be added.

The assay type cannot be changed as we have started the loader with an assay (the one we manually created).

Step 5 - 2. Assemble multiple datafiles into a single dataset

Following the Number of datafile per dataset (2), the dataset builder table has two columns, one for read_1 , one for read_2 .

Dataset builder - The 12 files are injected in all columns upon initialisation (only 9 are visible on this screenshot)

All selected datafiles are initially present in both columns. Each column can independently set a filter to exclude certain files. This, in combination with coherent file names and the alphabetical order - used to sort the list of files in each column - naturally align datafiles together, which can then be staged as a single dataset.

  • In the column Datafile 1

    • Set the Read Type to Read 1 (default)
    • Set the Filter Preset to Standard Genecore (or manually set the Filter to _1)
  • In the column Datafile 2

    • Set the Read Type to Read 2 (default)
    • Set the Filter Preset to Standard Genecore (or manually set the Filter to _2)
GeneCore@EMBL

GeneCore@EMBL uses _1 and _2 to differentiate read_1 and read_2. Using the Filter Preset just sets the related Filter in the input box. In case _1 and _2 is not enough to filter the read files, it can be adapted manually e.g. respectively _1_sequence and _2_sequence.

As seen on the picture below, the table now only lists the files that are passing the filter in each column (in our case only the .txt.gz files contain _1 or _2 in their name). Datafiles are now naturally aligned together into datasets and can be staged.

Using column filters to assemble paired-end data

Dataset builder - Files in columns are filtered using the column filter. The «Standard Genecore» filter preset was used because it corresponds to our data, i.e. GeneCore uses the pattern _1 and _2 to differentiate reads.

Before staging datasets, we will use the builder advanced controls to automatically extract meaningful dataset names.

Step 5 - 3. Rename multiple datasets with regex

Here we want to demonstrate the option to automatically extract meaningful dataset names from datafiles names or path, using the advanced controls.

  • Open collection's Advanced Controls
Dataset builder - Button to open advanced controls
Dataset builder - Button to open advanced controls
  • Use the Regular Expression (RE, regex) ^([A-z]+)_.*(_[0-9]+).*$ in the Dataset name extractor to extract relevant part within the dataset name
Using Regular Expression to give better dataset names

Advanced collection controls with a Regular Expression set.

Considering the pattern on our file names, using the Regular Expression ^([A-z]+)_.*(_[0-9]+).*$ renames our datasets from e.g. XYZ_PE-DEMU_01_sequence.txt to XYZ_01 .

_1 and _2 are already removed from the auto-generated dataset name, before applying the dataset name extractor regex. This is because the default name is formed by concatenating only the common parts of the two datafile names.

Parts of the string are captured with parenthesis () . These captured groups are concatenated to create the new dataset name. For a more detail explanation about this regex, please visit regex101.com/xyz_01

  • Click Stage All
    Once staged, datasets appear in the Staged datasets bottom panel.

Step 5 - 4. Stage the derived data

The state of the builder after staging some files

After staging the raw data, nothing appears to be left to stage in the Paired-end raw reads collection. This is because the filters _1 nd _2 are still set and none of the remaining BAM files pass those filters (orange rectangle (a) on the picture above). This is okay because there is no more raw data to be staged in this collection.

State of the builder after staging some files

The Dataset Builder panel now displays a sum-up of what is staged and what is left to be staged at the top right (orange rectangle (b)). Here we see that, currently, 8 - out of the 12 originally selected - files are part of a dataset (the datasets that have just been staged at previous step).

The sum-up also displays a warning reading that 4 files are not assigned to any dataset. This is the 4 BAM files that are going to be staged in a collection of derived data.

  • Click Create collection to create a second collection
    The new collection is initialised with the Number of datafile per dataset set to 1 single datafile per dataset and Raw data set to false. This is exactly what we want.
    The new collection is initialised without any column filter, all files are listed in the table.
  • Name the new collection it Aligned reads
  • Set the Dataset name extractor to ^([A-z]+)_*(_[0-9]+).*$
A second collection of derived data

State of the builder after staging some files

The Staged Datasets panel

The Staged datasets panel displays a global sum-up at the top-right. Collections are organised into tabs

Staged datasets panel after staging all files

All names have been exctracted with regex and are meningful. Datasets are ready!

You now have been redirected to the third page of the wizard, the Verify page.

Step 6. Verify - Wizard (5/6)

The Verify page is the place to indicate (1) the Study for which the datasets have been generated, (2) whether samples should be assigned to the datasets and (3) in such case, indicate a sample type. It is also the place and time to review the information that has been assembled so far.

In this example, we are loading raw datasets and it is mandatory to assign samples to them. The sample type cannot be modified as only one sample type (Sequencing Library) is available for Sequencing data. For the derived data you can chose to assign samples or not, here we will do it.

  • For our Paired-end raw reads collection:

    • Assign an existing study:
      • Click Load
        This opens a modal in which the list of known studies is displayed.
        • Locate the study (using the personal filter and name column filters if needed)
          • For the «Tea Lovers»: «Paired-end WGS Darjeeling Tea genotypes for traineeXX»
          • For the «Coffee Lovers»: «Paired-end WGS Coffee arabica genotypes for traineeXX»
        • Select it by clicking the checkbox
        • Click Confirm to close the modal
          The study now appear in the Study dropdown
  • For our Aligned reads collection:

    • The study should have been set automatically to the same value as for the other collection
    • Toggle Assign Sample to true
  • Double-check the information

Wizard - Verify

The verify page displays a sum-up of the information assembled so far about dataset(s), collection(s), study(ies) and sample type. Here, we have selected the study named «paired-end-sequencing for training 102». Since we are registering raw data, assigning sample is mandatory. The sample type cannot be modified as there is only one accepted for illumina sequencing data. Note that our read 1 and read 2 are assembled into paired-end datasets.

You now have been redirected to the final page of the wizard, the Assign Sample page.

Step 7. Assign samples - Wizard (6/6)

The Assign Sample page is used to assign a sample to every dataset.

Create new samples or reuse existing ones

There are two different situations: Either the samples already exist (.e.g they have been previously created within an experiment, or created in the past for a different project but are used again to acquire new information), or they do not. In case they do not exist, they can automatically be created. By default, the dataset name is proposed as the sample name, but it can be adapted.

Also on this page:

  • The owner of each dataset can be modified, in case the data is being loaded for different users.
  • The Study can also be adjusted per dataset, in case more than one study was selected on the verify page

Here we will use the samples created previously at the beginning of this walkthrough

  • Click Load Samples at the top left of the Assign Sample page
  • In the modal, select the Personal list filter
  • Selects the 4 samples created at step 1, then click Confirm
Load samples modal

These samples are now available to be selected in the sample column

  • Click twice the sorting icon in the collection column header to sort rows per collection
  • Select each relevant sample on each row
Select loaded samples

You now have been redirected to a confirmation page stating the data is being loaded: Success

Registration success message
I mistakenly let new samples to be created when I should have linked my existing samples. What do I do ?

When this happens, one can merge both samples into one. Please refer to the hands-on «Sample Management - Sample Merging»


  1. «FASTQ format» article on Wikipedia: https://en.wikipedia.org/wiki/FASTQ_format 

  2. «Binary Alignment Map» article on Wikipedia: https://en.wikipedia.org/wiki/Binary_Alignment_Map