Skip to content

Dataset Model

Data Model Overview:

LabID operates on a hierarchical structure consisting of three layers: datasets, datafiles, and datafile copies.

  • Datasets: At the top layer, datasets serve as the primary logical unit of data. They can comprise one or more datafiles.
  • Datafiles: Datafiles represent individual pieces of data within a dataset. These contain the actual data e.g. stored on disk.

  • Datafile Copies: Different copies of the same data is tracked by the system. Each copy is associated to a status, giving information about the availibility of the data. A datafile copy can have the following statuses:

    Online: The copy is readily available on disk

    Offline: The copy is not readily available on disk (but we know where it last was)

    New: The copy just has been created. This is usually a transitionary status where we've recorded information about the data, but the data itself is still being transfered, and therefore not yet "online"

    Error: An error is associated with the copy itself. This should be addressed and solved. This could for example reflect a failure in the data transfer.

Datafiles and datasets also have statuses. The datafile status is a status based on the statuses of the different copies of this datafile. Similarly for datasets, the dataset status aggregates the status information of its datafiles.

Example of datafile copy tracking

  • Example 1: A dataset has a single datafile. This datafile has a single copy that can be directly accessed: the dataset, the datafile and the copy are all online.

  • Example 2: A dataset has a single datafile that has been archived. There are therefore two copies: the original copy that is still online and the archived copy on tape that is offline (because tape is not directly accessible). Overall, the dataset and the datafile are still online because one copy is still readily accessible.

  • Example 3: following on example 2, the original online copy has now been deleted from disk to free space, it is therefore also offline. Since now both copies are now offline, the dataset and the datafile also become offline

Lifecycle of a Datafile copy:

New: Upon import, each datafile is initially assigned the status "New." This indicates that the file has been registered in the system but has not yet been copied to its final protected location on the group share.

Transfer: The datafile undergoes a transfer process from its original location to the protected disk group share. This process occurs asynchronously and may take some time, particularly for large datasets. The status remain "new" until fully transferred.

Online: Once the copying process is complete and the datafile is accessible in its final location, its status is updated to "Online." This signifies that the datafile is ready for use and can be accessed by the user.

Offline: This indicates that the datafile copy is unavailable, usually because it was deleted from this location on disk. This generally means another copy has been made (an archived copy).

Error: If errors occur during the transfer process, its status is set to "Error". This indicates that there are problems with the datafile copy and further investigation or action may be required.

Status of Datafiles and Datasets:

  • Datafile Status: Each datafile has its own individual status, reflecting the status of its copies. If at least one copy has an error, the datafile will also has the error status.

  • Dataset Status: The dataset, composed of one or more datafiles, also its status, based on the status of its datafiles. If at least one datafile has an error, the dataset status will also be set to "Error"

Color code of statuses

The following screenshot displays the different statuses of dataset and their colors

Color code of dataset statuses