Skip to content

S3 Dataset Synchronisation

Understanding S3 synchronisation

Pre-requisites to synchronise datasets to S3

  1. An online S3 bucket needs to be available, this is up to the user to create it. Your bucket can be hosted on any S3-compatible cloud provider (e.g. AWS, Minio, etc.).
  2. An S3 access key and secret key need to be configured on the bucket for LabID access (both read and write modes); again this is up to the bucket's owner to perform this action.
  3. The S3 storage volume needs to be configured in LabID by an administrator (see Adding a cloud storage volume). Please get in touch with your LabID administrator for this and provide them with S3 credentials from (2).
  4. Datasets need to have at least one local and online copy

Datasets need to have the status ONLINE to be sendable to S3. Datasets that gave the status NEW (or more generally that do not have any online copy) cannot be sent to S3.

When sending a dataset to S3, LabID will create a new copy of the datafile(s), and register them with the dataset.

Creating an S3 bucket and S3 credentials

  • Bucket: The process of creating an S3 bucket is not covered in this documentation, as it is heavily dependent on the cloud provider you are using.
  • Credentials: Credentials for an S3 bucket are a pair of an access key and a secret key. Access keys can be generally added to a bucket, and associated to specific data access rules.

Data organisation on the S3 bucket receiving the data

When sending a dataset to S3, the user can choose to either "flatten" the destination directory structure, and creates datafile copy paths build uniquely using the dataset and the datafile identifiers (<dataset_id>/<datafile_id>). Flattening is useful when you want e.g. want to further use this data programmatically working with LabID metadata. Alternatively, the original LabID group data directory structure can be mimicked to obtain a more "human-readable" file organisation on the destination directory.


Sending datasets to S3

You can send an arbitrary list of datasets to S3 (from the dataset list page). However, you can also leverage the feature from the assay, study or project detail pages to send to S3 all datasets associated to an assay,a study or a project.

Select datasets

  1. Select datasets on the dataset list page.
  2. Open the Copy datasets to S3 interface by clicking the Send to S3 button () icon in the table toolbar (located at the top right of the dataset list table).
Example of the "Copy datasets to S3" interface that lists all the selected datasets, and reports which can *a priori* be sent to S3. ONLINE datasets you have access to are usually eligible for synchronisation. None eligible datasets are listed in the *Non-Transferable Datasets* box; in this example the selected dataset cannot be transferred as no ONLINE copy is available.

Fill in the copy options form

  1. The destination S3 storage volume, only S3 volumes configured in LabID for which you have been granted access will be available in the list.
  2. The destination directory on the S3 bucket. All selected datasets will be copied to this directory. The destination directory can be a new one, or an existing one. If the directory does not exist, it will be created.
  3. The desired copy strategy, regarding files & folders organisation on the destination volume:
    • Keep source directory structure: The original LabID group data directory structure is mimicked on the destination directory.
    • Flatten & generate unique names: The destination directory structure will be flattened, and datafile copy paths will be built uniquely using the dataset and the datafile identifiers (<dataset_id>/<datafile_id>).
  4. The strategy to adopt in case some data already exist at destination should be positioned, e.g., when performing a second synchronisation after adding datasets to a study or a project. Available strategies are the following:
    • Skip: Existing files at destination will not be overwritten, and the new datafile copy will not be sent to S3. This is useful when you want to avoid sending data that you know already exists on S3.
    • Fail if exists: Existing files at destination will not be overwritten, and the synchronisation will fail. This is useful when you want to ensure that no data is overwritten on S3 (the data is not supposed to be present on the S3 bucket in the specified destination directory).
    • Overwrite: Existing files at destination will be overwritten with the new datafile copy.
    • Overwrite if newer: Existing files at destination will be overwritten only if the new datafile copy is newer than the existing one.

Data file copies are identified by their full S3 path

The same datafiles can exist at different locations (paths and/or name) on a bucket. The S3 synchronisation only considers the final S3 path to check if a data file already exists on the S3 bucket. In other words, both the destination directory and the copy strategy must be the same to identify existing copies on the destination S3 bucket; otherwise a new data file copy is always created on the S3 bucket.

Once the form is completed, you are presented with the expected destination directory tree and can confirm the transfer by clicking the Copy to S3 button at the bottom of the pop-up.

S3 character compatibility

Certain characters that can be found in file paths are not supported by S3 storage systems. LabID automatically transforms file paths that contain such characters to ensure compatibility with S3 object naming requirements. This transformation occurs before the data is copied and is reported in the destination directory tree preview, so you can see how your file names will appear on S3 before launching the copy.

Character transformations

  • Brackets [ ] is replaced by parentheses ( )
  • Braces { } is replaced by parentheses ( )
  • Backslash \ is replaced by hyphen -
  • Special characters like ^, %, ", >, <, ~, #, | are replaced by hyphens or underscores

For example, a file named sample[1]_data{final}.fastq would become sample(1)_data(final).fastq on S3.

Programmatic API access

For developers or automated workflows, S3 dataset copying is available through the LabID REST API. The functionality follows a two-step process:

  1. Resolve destination paths - Validate requests and resolve file paths based on your chosen strategy
  2. Execute copy operation - Perform the actual data transfer to S3

API endpoints

The S3 dataset copying functionality is documented in detail in the API documentation:

About the displayed directory tree

The destination tree is a representation of how the selected data will be organised on the destination volume, based on the copy and overwrite strategy. It does not show the data that would already be located on the destination volume (and potentially overwritten)