Skip to content

EGA

Publishing Sequencing Data to EGA

The procedure to publish data to EGA described below is performed using the command line client and requires advanced unix skills.

The procedure to submit to EGA explained here follows the programmatic submission using XML documents. It is very similar to the EGA submission with a few differences since EGA requires different metadata fields and additional information related to the data access committee, the data access policy, and the dataset connection to the mentioned policy.

Warning

The European Genome-phenome Archive (EGA) only accepts human data. Human data obtained on commercially available cell lines may still be submitted to the European Nucleotide Archive (EGA) instead.

Important

Submitting data to EGA requires to provide them with a signed Data Processing Agreement as explained here; so make sure to have this in place before you start the submission process.

You will need a Legacy Submission Accounts (ega-box-XXXX) – These accounts are required for XML-based submissions. With such accounts, transferring files with FTP to legacy submission accounts (ega-box-XXXX) occurs via ftp.ega.ebi.ac.uk. To request one, please fill out the following form. Once submitted your request needs to be processed and validated by the EGA staff before you can use it (wait for their feedback).

Importantly, XML-based metadata registration is not compatible with DACs and Policies registered through the DAC Portal, and your DAC and Policy must be submitted using the WEBIN Portal or via XML submission.

Once you have your legacy submission account, you can proceed with the submission process. We advise creating a file ega.yml containing the account information like (and store it in a private location):

username: ega-box-XXXX
password: 123456

Step 1: Review the study by exporting a MAGE-TAB document

After you checked the information in LabID and made sure that everything looks correct according to the Data Preparation instructions, we advise exporting the study as a MAGE-TAB document. It is sometimes challenging to have a complete overview of the study information in LabID. The MAGE-TAB document contains all the information about the study, including sample descriptions, protocols, and fastq files and is therefore a good way to review the study information and spot errors or missing information. It also gives a good idea of what the study will look like for other people.

The command line to export a MAGE-TAB document for a study is as follows:

$ labid export study -s <study_id> --format magetab -o <output_dir> -f magetab.txt
where:

  • <study_id> is the UUID of the study you want to export,
  • <output_dir> is the directory where you want to save the MAGE-TAB document.

Tip

We advise to create a folder for your submission, e.g. ega_submission, and to use this folder as parent of the <output_dir>. This way, you will have all the files related to your submission in one place. Assuming you created a folder called /home/username/ega_submission, the <output_dir> could be /home/username/ega_submission/magetab/.

After successful export, the exported MAGE-TAB document (named magetab.txt in the example) will be found in the <output_dir> directory, it is a plain tab-delimited text file and can be open in spreadsheet editor (e.g., Excel) for further inspection.

Additionally, a report_<study_UUID>.txt file is generated in the <output_dir> directory. This file contains information about the export process, including any errors or warnings that were identified during the export. Please take a look at this report file when inspecting the MAGE-TAB document to ensure that everything is correct before proceeding with the data upload. If you spot mistakes, you can correct them in LabID and re-run the export command until you are satisfied with the result.

Important

Submission to EGA requires providing the sex, phenotype and subject_id information on sample. These are collected from the LabID annotations named Sex (or Gender), Phenotype and Individual, respectively. When your samples are not annotated with these annotations in LabID, the EGA export procedure adds them with default values (unknown sex, not reported, and not reported respectively) but these defaults will not be visisble in the exported MAGE-TAB. If you have sensible values for these annotations, please make sure to annotate your samples with Sex (or Gender), Phenotype and Individual in LabID.

Step 2: Submit the DAC and Policy to EGA

As mentioned above, EGA submission requires submitting a DAC and Policy before one can declare the datasets to be shared. When using the programmatic submission, this must be done using the WEBIN Portal or via XML submission. Here we describe the second option (we did not test the first option as of writing).

First, manually assemble the dac.xml using the example provided on the EGA documentation, and save it in the same directory as the other files e.g. /home/username/ega_submission/dac.xml.

Tip

The DAC can be re-used to several studies/datasets for example if your project is large and contains several studies. For more information on DAC, please consult EGA documentation

The XML file should look like this (make sure to replace the relevant values with your own):

<?xml version="1.0" encoding="UTF-8"?>
<DAC_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="EGA.dac.xsd">
    <DAC alias="DAC_123456" center_name="EMBL" broker_name="EGA">
        <TITLE>DAC for my-wonderful-project-name</TITLE>
        <CONTACTS>      
            <CONTACT name="Lastname, Firstname" email="me@somewhere.com" telephone_number="XXX" organisation="EMBL" main_contact="true"/>
        </CONTACTS>
    </DAC>
</DAC_SET>
where

  • DAC alias is the DAC name you want to use (this should be unique),
  • center_name is the name of your center as associated with your Webin account,
  • TITLE should contain a name for this DAC,
  • CONTACTS lists all the members of this DAC (each as a CONTACT exposing the name, email, telephone number and organisation). The main_contact attribute should be set to true for the main contact person. At least one contact should be provided in which case it must be the main_contact.

Then submit this dac.xml file to EGA using the following command:

# create the dac.xml giving the login credentials directly in the curl command
# we assume your username is ega-box-XXXX and your password is 123456 (same as in the ega.yml file)
curl -u ega-box-XXXX:123456 -F "ACTION=ADD" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" -F "DAC=@dac.xml"
Upon successful submission, the resulting dac_receipt.xml gives you the DAC accession (e.g.EGAC000010035NN) that you will need to include in the policy.xml file.

Second, assemble policy.xml using the example provided on the EGA documentation, and save it in the same directory as the other files e.g. /home/username/ega_submission/policy.xml.

The XML file should look like this (make sure to replace the relevant values with your own):

<?xml version="1.0" encoding="UTF-8"?>
<POLICY_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="EGA.policy.xsd">
    <POLICY alias="DAC_EGAC000010035NN_policy" center_name="EMBL" broker_name="EGA">
        <TITLE>Policy for my-wonderful-project-name</TITLE>
        <DAC_REF accession="EGAC000010035NN" refcenter="ENA"/>
        <POLICY_TEXT>
[Here comes the policy text approved by your legal department.]
        </POLICY_TEXT>
    </POLICY>
</POLICY_SET>
where:

  • POLICY alias is the policy name you want to use (this should be unique),
  • center_name is the name of your center as associated with your Webin account,
  • TITLE should contain a name for this policy,
  • DAC_REF should contain the DAC accession you received in the dac_receipt.xml file e.g. EGAC000010035NN (this is the DAC you created in the previous step),

Please note that there are many ways to write this policy.xml. The example above is a simple one, and you should seek support from your legal department to write a proper policy text.

Then submit this policy.xml file to EGA using the following command:

# create the policy.xml giving the login credentials directly in the curl command
# we assume your username is ega-box-XXXX and your password is 123456 (same as in the ega.yml file)
curl -u ega-box-XXX:123456 -F "ACTION=ADD" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" -F "POLICY=@policy.xml"

Upon successful submission, the resulting policy_receipt.xml gives you the POLICY accession (e.g.EGAP000010036NN) that you will need when submitting the datasets to EGA (final step of the procedure).

Step 3: Encrypt FastQ files

First, export links to the fastq files to a directory of your choice ("output_dir"):

$ labid export links -s <study_id> -t <output_dir>

where:

  • <study_id> is the UUID of the study you want to export (i.e., the same as in step 1),
  • <output_dir> is the directory where you want to create the symbolic links.

Tip

When you followed our advise to create a folder for your submission, e.g. /home/username/ega_submission, the <output_dir> would be /home/username/ega_submission/fastq/.

Second, you need to encrypt your files using EGA-cryptor (EGA-cryptor encryption is required for Webin/FTP submission). All the fastq files (exported in the same dir e.g., /home/username/ega_submission/fastq/) can be encoded in a single command line:

java -jar ega-cryptor-2.0.0.jar -t 4 -i <indir> -o <outdir> 
where:

  • <indir> is the directory where the symbolic links have been exported e.g. /home/username/ega_submission/fastq/,
  • <outdir> is the directory where you want to save the encrypted files e.g. /home/username/ega_submission/fastq-encrypted/
  • -t 4 is the number of threads to use for the encryption process. You can adjust this number based on your system's capabilities.

Step 4: Upload encrypted FastQ files to EGA

This requires you to have a validated Legacy Submission Accounts (ega-box-XXXX) with EGA.

Upload the encrypted fastq files (.gpg files with their .md5) to EGA using FTP command and your ega-box-XXXX credentials, here we use the lftp FTP client (note that at EMBL, one can use module load lftp on any server). As FTP upload can be quite long, we advise using a screen session (or equivalent)

 # go to the directory where you encrypted files are located e.g. /home/username/ega_submission/fastq-encrypted/
 $ cd <encrypted_dir>
 # start a screen session
 $ screen -S ega-upload 
 # connect to EGA FTP server with your Webin account
 $ lftp ftp.ega.ebi.ac.uk -u ega-box-XXXX
 # -> enter password at the prompt 
 # upload all md5 files
 $ mput *md5
 # upload all gpg files
 $ mput *gpg
 # exit
 $ bye

Step 5: Export the study as ENA tables (with EGA profile)

This will export the same information as contained in the MAGE-TAB but in simple tables suitable for the ena-upload-cli tool

We first export tables suitable for the ena-upload-cli tool. These tables basically contain the same information as the MAGE-TAB document. The tables are exported in a directory of your choice ("output_dir"):

$ labid export study --format enatables --ega --study-type <study_type> -s <study_id> --encrypted-dir <encr_dir> -o <output_dir> 

where:

  • --ega is used to specify that the study is for EGA submission,
  • <study_type> is the type of study you want to publish (e.g., RNA-seq, ChIP-seq, etc.), see help for options (labid export study --help),
  • <study_id> is the UUID of the study you want to export (i.e., the same as in step 1),
  • is the directory where the encrypted fastq files are located (e.g. /home/username/ega_submission/fastq-encrypted/),
  • <output_dir> is the directory where you want to export the tables.

Tip

When you followed our advise to create a folder for your submission, e.g. /home/username/ega_submission, the <output_dir> would be /home/username/ega_submission/enatables/.

After successful export, four exported tables should be found in the <output_dir> directory, and named ena_study_<study_id>.txt, ena_sample_<study_id>.txt, ena_experiment_<study_id>.txt, and ena_run_<study_id>.txt.

Step 6: Upload the EGA tables as XML documents to EGA

The following LabID command is used to:

  1. call the ena-upload-cli tool with the tables exported in the previous step. The ena-upload-cli tool will generate the XML documents and upload them to ENA.
  2. register back the accession numbers found in the response (ENA Study, ENA Dataset & samples, as well as BioSamples accessions) in LabID. This action only occurs when the XML submission is successful.
$ labid export ena --ega -f /path/to/ega.yml -c <YOUR_CENTER> -s <study_id> -d <output_dir> --no-data-upload --execute

where:

  • --ega is used to specify that this is an EGA submission (submitting to ENA is the default),
  • /path/to/ega.yml is the path to the file containing your EGA credentials (see above),
  • <YOUR_CENTER> is the name of your center as associated with your Webin account,
  • <study_id> is the UUID of the study you want to publish,
  • <output_dir> is the directory where the ENA tables are located.
  • --no-data-upload is used to skip the data upload step, as this has already been done in a previous step.
  • --execute is used to execute the command and submit the data to EGA for real, you may remove this option to perform a submission to the ENA test server. This is highly advised to validate the submission against the test server before submitting to the production server.

Important

The command expects the ena-upload-cli tool to be in your $PATH. If not or if you want to use a specific version of the tool, you can specify the path to the tool using the --ena-upload-cli option.

Upon successful submission, a receipt_<timestamp>.xml lists all created objects and their accession, you will need this receipt to submit the dataset.

Step 7: Submit the dataset(s) to EGA

LabID allows you to submit datasets to EGA using the export egadataset command. The command will create the dataset.xml by turning each submitted run into a dataset, and submits it to EGA.

labid export egadataset -p EGAP000010036NN -c <YOUR_CENTER> -x /path/to/ega.yml -s <study_id> -r <receipt> -f dataset.xml --execute

where:

  • EGAP000010036NN is the policy accession you received in the policy_receipt.xml file,
  • <YOUR_CENTER> is the name of your center as associated with your account,
  • /path/to/ega.yml is the path to the file containing your EGA credentials (see above),
  • <study_id> is the UUID of the study you want to publish,
  • <receipt> is the path to the receipt_<timestamp>.xml file you received after the submission of the study (previous step),
  • --execute is used to execute the command and submit the data to EGA for real, you may remove this option to perform a submission to the ENA test server. This is highly advised to validate the submission against the test server before submitting to the production server.