Data Storage Organization, Access & Usage¶
The group data repository¶
Once datasets are registered, it is moved to the group data repository (or more generally a LabID Data Library as it is now possible to define multiple data storage locations per group). This repository is managed by LabID and may be located within the group share on the file system; this is for example the case at EMBL as this allows to have each group paying for their own space usage.
Data ownership¶
All files and folders within this repository belong to LabID. This is a purely (UNIX related) technical aspect as this allows to effectively ensure the integrity of the data (protected against unwanted modifications, i.e. renaming, move, deletion).
Despite this, the data remains directly accessible on the filesystem, for all members of the group. This means the repository can be browsed with a standard file explorer and the files can directly be read through within analysis pipelines, or from UNIX commands.
As the file repository is on your group share, the group is effectively paying for storage (IT). LabID does not bill anything for this service
Data repository structure¶
The repository ahd been developed to be logically organized in order to allow seamless manual navigation
- Raw data is stored in a
run folderwhich contains all the data generated by an assay
User Dropbox¶
The dropbox is a particular folder within this repository, opened for write access. They allow users to place data in order to register it within LabID, as part of the manual data import feature.
Datasets can be annotated¶
Just like samples, datasets can be annotated. This allows yourself and others to easily locate them by e.g. organism, genotype, dev. stage, disease, etc.
This effectively can be used as a data warehouse indexing all raw and analysed datasets for future re-use.
Datasets can be sync'ed to Galaxy¶
All datasets of a given study can be made available on the EMBL galaxy instance in just a few clicks. The datasets will be listed under your Galaxy group library, in the folder of your choice (named after the study by default).
- This does not duplicate the data, the datasets are read by Galaxy directly from the data repository.
- Syncing can take a few minutes, and is executes as a background task. An email will be sent when the data is ready.
Datasets can be shared on S3¶
- All datasets of a given study, project or assay can be made available on an S3 bucket of your choice in just a few clicks, allowing to easily share data with collaborators or external partners.