S3 Dataset Synchronisation¶
Datasets registered in LabID can be synchronised to S3 buckets. When sending a dataset to S3, LabID creates a new copy of the datafile(s), and registers them as new data file copies of the dataset. The dataset remains the single source of truth, and LabID lists at least 2 copies for this dataset: the original copy, and the copy on S3.
One way on-demand synchronisation¶
We support only one-way on-demand synchronisation of datasets to S3. This means that a dataset sent to S3 is known to LabID (date of creation, ownership information, destination path, etc.) but remain somewhat disconnected. First, no automated synchronisation is performed between LabID and S3, this means that changes apply to either side will not be automatically reflected on the other side.
- If the original data changes, the copy on S3 will not be updated unless the user explicitly triggers a new synchronisation.
- If the data is deleted on S3, LabID will not be aware of it and will still list the S3 copy as it was last synchronised.
In a nutshell
- The bucket creation and cost is supported by the user, and is not managed by LabID.
- This does duplicate the data. Logic has been implemented to handle cases where the destination datafile copy already exists; where the copy files of that already exist at destination can be overwritten or skipped.
- Syncing can be quite long depending on the volume of copied data.
- Programmatic access is available via REST API for automated workflows (see API documentation).