Skip to content

Workflow File Types and Data Organization

LabID uses a semantic file type system to organize and classify files within workflows. This system enables intelligent workflow processing, proper RO-Crate packaging, and clear organization of workflow components.

File Type Categories

Workflow File Types

Based on the actual implementation in workflows/utils/constants.py, LabID supports these file types:

MAIN - Primary Workflow Definition

  • Purpose: The primary workflow definition file that serves as the entry point
  • Requirements:
    • Only one MAIN file allowed per WorkflowVersion
    • Required for publishing to WorkflowHub
    • Must represent the executable workflow entry point
  • Examples: Snakefile, main.nf, workflow.cwl, workflow.wdl
  • Validation: System enforces single MAIN file constraint
# From the codebase - only one MAIN file allowed
if self.data_type == MAIN:
    if self._state.adding and self.workflowversion.workflowfiles.filter(data_type=MAIN).exists():
        raise ValueError(_("There can only be one main file."))

CONFIG - Configuration Files

  • Purpose: Configuration files and settings
  • Usage: Runtime configuration, environment settings
  • Examples: config.yaml, nextflow.config
  • Dataset Association: Can be also linked to WorkflowRuns as configuration datasets

PARAMETER - Parameter Files

  • Purpose: Specific parameter definitions and input specifications
  • Usage: Input parameters, variable definitions, parameter schemas
  • Examples: parameters.yml, params.json, params.txt
  • Dataset Association: Can be linked to WorkflowRuns as parameter datasets

README - Documentation

  • Purpose: Primary documentation and usage instructions
  • Auto-detection: Automatically detected during import
  • Examples: README.md, README.txt, README.rst
  • Standards: Supports various documentation formats

LICENSE - License Information

  • Purpose: License files and legal information
  • Auto-detection: Automatically detected during import
  • Examples: LICENSE, LICENSE.txt, COPYING
  • Compliance: Important for workflow sharing and publication

TEST - Test Files

  • Purpose: Test files, validation scripts, and examples
  • Usage: Unit tests, integration tests, example runs
  • Examples: test_workflow.py, test_data/, examples/
  • Organization: Often organized in tests/ or examples/ directories

DAG_PNG - Workflow Diagrams

  • Purpose: Visual representations of workflow structure
  • Usage: Workflow diagrams, dependency graphs
  • Examples: dag.png, workflow_diagram.svg
  • Generation: Often auto-generated by workflow engines

OTHER - Miscellaneous Files

  • Purpose: Files that don't fit other categories
  • Usage: Default type for unclassified files
  • Examples: Scripts, utilities, auxiliary files
  • Flexibility: Catch-all category for diverse file types

Workflow Run Specific File Types

INPUT - Input datasets

  • Purpose: Workflow execution input files, excluding configuration & parameter files
  • Usage: Input data on which the workflow operates
  • Examples: sample.fastq, image.tiff
  • Mandatory: At least one input dataset is required for a valid WorkflowRun
  • Dataset Association: Can be linked to WorkflowRuns as input datasets

CONFIG - Configuration Files

  • Purpose: Configuration files and settings
  • Usage: Runtime configuration, environment settings
  • Examples: config.yaml, nextflow.config
  • Dataset Association: Can be linked to WorkflowRuns as config datasets

PARAMETER - Parameter Files

  • Purpose: Specific parameter definitions and input specifications
  • Usage: Input parameters, variable definitions, parameter schemas
  • Examples: parameters.yml, params.json, params.txt
  • Dataset Association: Can be linked to WorkflowRuns as parameter datasets

OUTPUT - Output Specifications

  • Purpose: Workflow execution output files, excluding reports and logs
  • Usage: Output data generated by the workflow execution
  • Mandatory: At least one output dataset is required for a valid WorkflowRun
  • Examples: sample.bam, count_table.csv, images.ome.zarr
  • Dataset Association: Can be linked to WorkflowRuns as output datasets

LOG - Log Files

  • Purpose: Execution logs and runtime information
  • Usage: Workflow execution logs, debug information
  • Examples: workflow.log, execution.log
  • Dataset Association: Can be linked to WorkflowRuns as log datasets

REPORT - Report Files

  • Purpose: Analysis reports and summaries
  • Usage: Generated reports, analysis summaries
  • Examples: report.html, summary.pdf
  • Dataset Association: Can be linked to WorkflowRuns as report datasets

File Type Detection and Assignment

Automatic Detection

During workflow import, LabID automatically detects and suggests file types:

def get_workflow_type_and_files(repository, commit_hash):
    """Detect workflow type and suggest file types based on repository content"""
    # Automatically detects:
    # - README files (README.md, README.txt, etc.)
    # - LICENSE files (LICENSE, COPYING, etc.)
    # - Workflow-specific patterns (Snakefile, main.nf, etc.)

Workflow-Specific Detection

The system includes detection logic for different workflow managers:

  • Snakemake: Detects Snakefile, *.smk files
  • Nextflow: Detects main.nf, nextflow.config
  • CWL: Detects *.cwl files
  • WDL: Detects *.wdl files
  • Galaxy: Detects *.ga files

Manual Assignment

Users can manually assign or change file types through the web interface:

  • File types can be changed regardless of WorkflowVersion commit status
  • Type changes are immediately reflected in the workflow organization
  • Validation ensures MAIN file constraints are maintained

Dataset Association Types

When linking datasets to WorkflowRuns, these types are used:

DATASET_TO_WORKFLOWRUN_TYPES = (
    (INPUT, INPUT_LABEL),      # Input datasets
    (OUTPUT, OUTPUT_LABEL),    # Output datasets  
    (CONFIG, CONFIG_LABEL),    # Configuration datasets
    (REPORT, REPORT_LABEL),    # Report datasets
    (LOG, LOG_LABEL),          # Log datasets
    (PARAMETER, PARAMETER_LABEL), # Parameter datasets
)

Type Categories for Dataset Association

  • INPUT_TYPES: (INPUT, CONFIG, PARAMETER) - Data flowing into workflows
  • OUTPUT_TYPES: (OUTPUT, REPORT, LOG) - Data produced by workflows

Publishing and Standards Compliance

WorkflowHub Publishing Requirements

For publishing to WorkflowHub:

  • MAIN file is mandatory - Must have exactly one MAIN file
  • RO-Crate compliance - File types map to RO-Crate metadata
  • Metadata preservation - File types become part of workflow metadata

RO-Crate Mapping

File types are mapped to RO-Crate standards:

  • MAIN → Primary workflow entity in RO-Crate
  • CONFIG/PARAMETER → Configuration entities
  • TEST → Test entities
  • README/LICENSE → Documentation entities

Best Practices

File Organization

  1. Single MAIN File: Always designate exactly one primary workflow file
  2. Logical Grouping: Use appropriate types to group related files
  3. Documentation: Include README and LICENSE files for shared workflows
  4. Testing: Include TEST files for workflow validation

Type Assignment Strategy

  1. Start with Auto-detection: Let the system suggest initial types
  2. Review and Refine: Manually adjust types as needed
  3. Maintain Consistency: Use consistent typing across workflow versions
  4. Consider Publishing: Ensure MAIN file is properly designated for sharing

Version Management

  1. Type Flexibility: File types can be changed between versions
  2. Path Consistency: Same file paths can have different types across versions
  3. Validation: System validates type constraints during operations

API and Integration

File Type Validation

# Valid file types from constants
WORKFLOWFILE_DATA_TYPES_CHOICES = [
    (MAIN, "Main"),
    (CONFIG, "Config"), 
    (LOG, "Log"),
    (PARAMETER, "Parameter"),
    (INPUT, "Input"),
    (TEST, "Test"),
    (OUTPUT, "Output"),
    (OTHER, "Other"),
    (DAG_PNG, "DAG PNG"),
    (README, "README"),
    (LICENSE, "License"),
]

API Usage

File types are specified in API calls:

{
  "data_type": {"value": "MAIN"},
  "file_path": "Snakefile",
  "name": "Main Workflow"
}

CLI Integration

When registering workflows via CLI, file types can be specified:

{
  "data_type": "OUTPUT",
  "name": "Analysis Results",
  "datafiles": [...]
}

Troubleshooting

Common Issues

  1. Multiple MAIN Files: System prevents multiple MAIN files per version
  2. Missing MAIN File: Publishing requires exactly one MAIN file
  3. Type Validation: Invalid types are rejected by the API
  4. Empty Files: Empty files are not allowed for upload

Error Messages

  • "There can only be one main file." - Attempting to add multiple MAIN files
  • "'INVALID_TYPE' is not a valid argument" - Using non-existent file type
  • "Empty files are not allowed." - Attempting to upload empty files

Resolution Strategies

  1. Review File Types: Ensure proper type assignment before publishing
  2. Check Constraints: Verify MAIN file requirements are met
  3. Validate Content: Ensure files have content before upload
  4. Use Auto-detection: Leverage automatic type detection for initial setup