CoFest 2024
Summarize workflow description for potential security risks, for data visiting app
Prerequisites
to pack a workflow and the related sources
running on my MacBook Air (M1, 16GB memory)
a tiny command-line openai api client
code:bash
# Install ollama and run, takes a bit to install a 5GB model
ollama run llama3
pulling manifest
pulling 6a0746a1ec1a... 100% ▕████████████████████████▏ 4.7 GB
pulling 4fa551d4f938... 100% ▕████████████████████████▏ 12 KB
pulling 8ab4849b038c... 100% ▕████████████████████████▏ 254 B
pulling 577073ffcc6c... 100% ▕████████████████████████▏ 110 B
pulling 3f8eb4da87fa... 100% ▕████████████████████████▏ 485 B
verifying sha256 digest
writing manifest
removing any unused layers
success
>>
code:bash
code:output1
This is a CWL (Common Workflow Language) workflow definition file. It describes a workflow that downloads sequence data from the Sequence Read Archive (SRA) and dumps it to FASTQ files.
Here's an overview of the workflow:
**Workflow Name**: Download sequence data from SRA and dump to FASTQ file
**Inputs**:
* nthreads: Optional, default=4. Number of threads to use by fasterq-dump.
* repo: Optional, default="ddbj". Target repository (ncbi or ebi).
* run_ids: List of SRA Run IDs.
**Outputs**:
* fastq_files: An array of FASTQ files generated from the input SRA data.
**Steps**:
1. **Download SRA**: This step runs a separate CWL workflow (#download-sra.cwl) that downloads the SRA data based on the input repo and run_ids. The output is an array of SRA files.
2. **Fasterq-dump**: This step runs another CWL workflow (#fasterq-dump.cwl) that takes the SRA files from the previous step, uses the specified number of threads (nthreads), and dumps them to FASTQ files.
**Metadata**:
* cwlVersion: The version of the CWL specification used in this workflow (v1.0).
* license: The license under which the workflow is released (Apache-2.0).
* codeRepository: The GitHub repository where the workflow code can be found.
* author: The person who created the workflow, with their ORCID ID and email address.
**Schema information**:
code:bash
$ echo -e "Below is a workflow decsription used in bioinformatics. List up the potential risks to the data and computational environment caused by running this workflow.\n\n$(cat ./download-fastq.packed.cwl)" | yllm --api-url http://localhost:11434/v1/chat/completions --model llama3 --stdin code:output2
Based on the workflow description, potential risks include:
1. **Data integrity**: The workflow involves downloading sequence data from the Sequence Read Archive (SRA) and dumping it to FASTQ files. There is a risk of data corruption or loss during this process.
2. **Authentication issues**: The workflow requires authentication with the SRA repository. If the authentication fails, the workflow may not be able to access the required data, leading to errors or incomplete results.
3. **Network connectivity problems**: The workflow relies on network connectivity to download data from the SRA and dump it to FASTQ files. Disruptions in network connectivity can cause delays or failures in the workflow.
4. **Computational resource limitations**: The workflow uses computational resources (e.g., threads) that may not be sufficient for large datasets, leading to slow performance or errors.
5. **Software dependencies**: The workflow relies on specific software tools (e.g., fasterq-dump) and libraries. If these dependencies are not installed or updated correctly, the workflow may fail or produce incorrect results.
6. **Data quality issues**: The workflow assumes that the input data is of high quality and well-formatted. Poorly formatted or low-quality data can lead to errors or incorrect results in downstream analyses.
7. **Repository maintenance**: The workflow relies on the SRA repository being maintained and updated regularly. If the repository is not properly maintained, the workflow may not be able to access the required data.
8. **Version control issues**: The workflow uses specific versions of software tools and libraries. If these versions are not correctly managed or updated, the workflow may fail or produce incorrect results.
These risks can be mitigated by implementing proper error handling, testing, and validation procedures in the workflow, as well as ensuring that all dependencies are properly installed and configured.
use cwltool to generate a job configuration file and run the workflow to get the outputs
code:bash
$ cwltool --make-template download-fastq.packed.cwl | sed -e 's:a_string:SRR1274307:' > job.yml
INFO /Users/inutano/repos/miniforge3/bin/cwltool 3.1.20240708091337
INFO Resolved 'download-fastq.packed.cwl' to 'file:///Users/inutano/tmp/cofest2024/download-fastq.packed.cwl'
WARNING Workflow checker warning:
download-fastq.packed.cwl:482:21: Source 'repo' of type "null", "string" may be incompatible download-fastq.packed.cwl:518:25: with sink 'repo' of type "string"
$ cwltool --debug download-fastq.packed.cwl job.yml
use Sapporo (a WES implementation) to run the workflow
code:bash
$ python3 -m pip install sapporo
sapporo
INFO: === Initializing app state... ====
INFO: Service info: id='sapporo-service' name='sapporo-service' type=ServiceType(group='sapporo-wes', artifact='wes', version='sapporo-wes-2.0.0') description='The instance of the Sapporo-WES.' organization=Organization(name='Sapporo-WES Project Team', url=Url('https://github.com/orgs/sapporo-wes/people')) contactUrl=Url('https://github.com/sapporo-wes/sapporo-service/issues') documentationUrl=Url('https://github.com/sapporo-wes/sapporo-service/blob/main/README.md') createdAt='2024-07-17T18:17:26Z' updatedAt='2024-07-17T18:17:26Z' environment=None version='2.0.2' workflow_type_versions={'CWL': WorkflowTypeVersion(workflow_type_version='v1.0', 'v1.1', 'v1.2'), 'WDL': WorkflowTypeVersion(workflow_type_version='1.0'), 'NFL': WorkflowTypeVersion(workflow_type_version='1.0', 'DSL2'), 'SMK': WorkflowTypeVersion(workflow_type_version='1.0'), 'StreamFlow': WorkflowTypeVersion(workflow_type_version='v1.0')} supported_wes_versions='1.1.0', 'sapporo-wes-2.0.0' supported_filesystem_protocols='file', 'http', 'https' workflow_engine_versions={'cwltool': WorkflowEngineVersion(workflow_engine_version='3.1.20240508115724'), 'nextflow': WorkflowEngineVersion(workflow_engine_version='22.04.4'), 'toil (experimental)': WorkflowEngineVersion(workflow_engine_version='4.1.0'), 'cromwell': WorkflowEngineVersion(workflow_engine_version='87'), 'snakemake': WorkflowEngineVersion(workflow_engine_version='v8.15.2'), 'ep3 (experimental)': WorkflowEngineVersion(workflow_engine_version='v1.7.0'), 'streamflow': WorkflowEngineVersion(workflow_engine_version='0.1.3')} default_workflow_engine_parameters={'nextflow': DefaultWorkflowEngineParameter(name='-dsl1', type='str', default_value=''), 'snakemake': DefaultWorkflowEngineParameter(name='--cores', type='int', default_value='1'), DefaultWorkflowEngineParameter(name='--use-conda', type='str', default_value='')} system_state_counts={} auth_instructions_url=Url('https://github.com/sapporo-wes/sapporo-service/blob/main/README.md#authentication') tags={} INFO: Executable workflows: workflows=[]
INFO: === App state initialized. ===
INFO: Started server process 23078 INFO: Waiting for application startup.
INFO: DB snapshot scheduler started.
INFO: Application startup complete.
code:bash
{
"id": "sapporo-service",
"name": "sapporo-service",
"type": {
"group": "sapporo-wes",
"artifact": "wes",
"version": "sapporo-wes-2.0.0"
},
"description": "The instance of the Sapporo-WES.",
"organization": {
"name": "Sapporo-WES Project Team",
},
"createdAt": "2024-07-17T18:17:26Z",
"updatedAt": "2024-07-17T18:17:26Z",
"environment": null,
"version": "2.0.2",
"workflow_type_versions": {
"CWL": {
"workflow_type_version": [
"v1.0",
"v1.1",
"v1.2"
]
},
"WDL": {
"workflow_type_version": [
"1.0"
]
},
"NFL": {
"workflow_type_version": [
"1.0",
"DSL2"
]
},
"SMK": {
"workflow_type_version": [
"1.0"
]
},
"StreamFlow": {
"workflow_type_version": [
"v1.0"
]
}
},
"supported_wes_versions": [
"1.1.0",
"sapporo-wes-2.0.0"
],
"supported_filesystem_protocols": [
"file",
"http",
"https"
],
"workflow_engine_versions": {
"cwltool": {
"workflow_engine_version": [
"3.1.20240508115724"
]
},
"nextflow": {
"workflow_engine_version": [
"22.04.4"
]
},
"toil (experimental)": {
"workflow_engine_version": [
"4.1.0"
]
},
"cromwell": {
"workflow_engine_version": [
"87"
]
},
"snakemake": {
"workflow_engine_version": [
"v8.15.2"
]
},
"ep3 (experimental)": {
"workflow_engine_version": [
"v1.7.0"
]
},
"streamflow": {
"workflow_engine_version": [
"0.1.3"
]
}
},
"default_workflow_engine_parameters": {
"nextflow": [
{
"name": "-dsl1",
"type": "str",
"default_value": ""
}
],
"snakemake": [
{
"name": "--cores",
"type": "int",
"default_value": "1"
},
{
"name": "--use-conda",
"type": "str",
"default_value": ""
}
]
},
"system_state_counts": {
"UNKNOWN": 0,
"QUEUED": 0,
"INITIALIZING": 0,
"RUNNING": 0,
"PAUSED": 0,
"COMPLETE": 0,
"EXECUTOR_ERROR": 0,
"SYSTEM_ERROR": 0,
"CANCELED": 0,
"CANCELING": 0,
"PREEMPTED": 0,
"DELETED": 0,
"DELETING": 0
},
"tags": {}
}
$ curl -X 'POST' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'workflow_engine=cwltool' \
-F 'workflow_attachment=@download-fastq.packed.cwl' \
-F 'workflow_params={"run_ids":"SRR1274307","repo":"ddbj","nthreads":4}' \
-F 'workflow_url=./download-fastq.packed.cwl' \
-F 'workflow_type=CWL'
{"run_id":"b564ac45-9299-449a-bdec-598bae51fd7a"}
$ echo -e "Below is a provenance information of a workflow execution. Please describe what the workflow has done.\n\n$(cat ./ro-crate-metadata.json)" | yllm --api-url http://localhost:11434/v1/chat/completions --model llama3 --stdin code:text
A workflow execution log!
From this log, we can see that the workflow has executed two steps:
**Step 1: download_sra**
* The step started and ran a Docker container with image ghcr.io/inutano/download-sra:a827d85.
* The command inside the container was download-sra -r ddbj SRR1274307, which likely downloaded a sequence read (SRA) file from the DDBJ database.
* The step completed successfully, using 12 MiB of memory.
**Step 2: fasterq_dump**
* The step started and ran another Docker container with image ncbi/sra-tools:3.0.0.
* The command inside the container was fasterq-dump /var/lib/cwl/stg0c84bfcb-9196-4d1e-ac8d-3d2941b157c8/SRR1274307.sra --threads 4 --skip-technical --split-files, which likely converted the downloaded SRA file to FASTQ format.
* The step completed successfully, reporting statistics on reads read and written.
**Workflow completion**
* The workflow completed successfully after both steps finished.
* The final process status is success.
code:bash
$ echo -e "Below is a workflow description. \n\n$(cat ./download-fastq.packed.cwl)\n\n Here is the provenance of its execution.\n\n$(cat ./ro-crate-metadata.json) \n\n Please describe what the workflow execution did." | yllm --api-url http://localhost:11434/v1/chat/completions --model llama3 --stdin code:text
Based on the provided workflow description, it appears that the workflow executed a series of steps to download and process a sequence read archive (SRA) file. Here's a breakdown of the steps:
1. The workflow started by resolving a CWL (Common Workflow Language) file named download-fastq.packed.cwl.
2. It then executed a step called download_sra, which downloaded an SRA file named SRR1274307 from the NCBI Sequence Read Archive (SRA) using the download-sra tool.
3. The workflow encountered a warning about the compatibility of the source and sink types in the CWL file, but it proceeded with the execution.
4. After completing the download step, the workflow executed another step called fasterq_dump, which extracted the FASTQ files from the downloaded SRA file using the fasterq-dump tool.
5. The fasterq_dump step reported some statistics about the number of reads and spots processed.
The final status of the workflow execution was "success", indicating that all steps completed without errors.
structured prompt
code:bash
$ echo -e "INSTRUCTION:\n\nWith the given workflow description and provenance of the workflow run, please describe the summary of the workflow including its objective, potential risks to the user's data and computational environment.\n\nWORKFLOW DESCRIPTION:\n\n$(cat ./download-fastq.packed.cwl)\n\nWORKFLOW PROVENANCE:\n\n$(cat ./ro-crate-metadata.json) \n\nSUMMARY OF THE WORKFLOW:" | yllm --api-url http://localhost:11434/v1/chat/completions --model llama3 --stdin code:output
Based on the provided workflow description and provenance, I can summarize the Sapporo stderr as follows:
The Sapporo stderr refers to the standard error output of a workflow run. In this case, the workflow is a CWL (Common Workflow Language) workflow that consists of two steps: download_sra and fasterq_dump.
The stderr output contains various messages, including:
1. Informational messages:
* Resolved the CWL file to its local path.
* Started and completed each step (download_sra and fasterq_dump) successfully.
2. Warning message:
* A workflow checker warning about potential incompatibility between the source and sink types in the download-fastq.packed.cwl file.
3. Docker-related messages:
* Running a Docker container to execute each step (download_sra and fasterq_dump).
* Specifying various options for the Docker run, such as mounting volumes, setting environment variables, and specifying the user ID.
The stderr output also includes some statistics about the fasterq_dump step, including the number of reads read and written. Overall, the Sapporo stderr provides a record of the workflow's execution, including any errors or warnings that may have occurred during the run.
it seems it only reads stderr
same with the 2nd attempt
code:bash
$ echo -e "Here is a workflow description. Please describe the potential risks from running this workflow.\n\n$(cat ./download-fastq.packed.cwl)" | yllm --api-url http://localhost:11434/v1/chat/completions --model llama3 --stdin code:output
Based on the workflow description, potential risks from running this workflow include:
1. **Data integrity**: The workflow downloads sequence data from the Sequence Read Archive (SRA) and dumps it to FASTQ files. If there are issues with the SRA data or the dumping process, the resulting FASTQ files may be corrupted or incomplete.
2. **Performance**: The workflow uses multiple threads by default, which can improve performance. However, if the number of threads is not optimized for the specific system or dataset, it could lead to decreased performance or even crashes.
3. **Resource usage**: The workflow requires significant computational resources (CPU and memory) to process large datasets. If the system is not adequately provisioned, it may lead to resource exhaustion, slowing down or crashing the workflow.
4. **Data security**: The workflow downloads data from a public repository (SRA). While this is a trusted source, there is still a risk of data tampering or corruption during transmission or storage.
5. **Dependence on external services**: The workflow relies on the SRA and FASTQ dumping tools to function correctly. If these services experience downtime or issues, the workflow may not be able to complete successfully.
6. **Lack of error handling**: The workflow does not appear to have built-in error handling mechanisms. If an error occurs during execution, it may not be handled properly, leading to unexpected behavior or crashes.
7. **Inadequate documentation**: While the workflow has some documentation, it is limited and may not provide enough information for users to understand how to use the workflow effectively or troubleshoot issues that arise.
To mitigate these risks, it would be essential to:
1. Thoroughly test the workflow on a small-scale dataset before running it on larger datasets.
2. Monitor system resources and adjust settings as needed to prevent resource exhaustion.
3. Implement error handling mechanisms to catch and handle errors gracefully.
4. Provide more comprehensive documentation for users, including instructions on how to use the workflow effectively and troubleshoot common issues.
5. Consider using alternative data sources or backup plans in case of service outages or data corruption.
The list of potential risks of running a bioinformatics workflow in an on-premise server
Todo
List up the
Optimize prompts
specify output format
Create a "risky" workflow