Rule descriptions
This is an automatically generated list of all supported rules, their docstrings, command, and virtual environment. At the start of each workflow run a list is printed of which rules will be run. And while the workflow is running it prints which rules are being started and finished. This page is here to give an explanation to the user about what each rule does, and for developers to find what is, and isn’t yet supported.
all
This contains most files needed for endpoints to be reached.
symlink_fastqs
Creates a symbolic link to the fastq files, so files can easily be accessed.
remove_adapters
Trims common sequencing adapters (using adapters found in data/adapters.fa) from the fastq reads.
Also performs quality-based trimming at the start and end of reads, and reads shorter than minlen
after quality-based trimming are removed entirely.
mkdir -p {params.proj}.noadpt/{wildcards.sample}
trimmomatic PE -threads {threads} \
-trimlog {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimlog \
-summary {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.summary \
-validatePairs {input.FORWARD} {input.REVERSE} \
-baseout {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed \
ILLUMINACLIP:{params.adpt}:2:30:10 SLIDINGWINDOW:4:20 LEADING:{params.leading} TRAILING:{params.trailing} MINLEN:{params.minlen} \
-phred33 {params.extra}
mv {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed_1P {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed.R1.fq
mv {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed_2P {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed.R2.fq
echo "completed adapter trimming of {wildcards.sample}"
fastQC_pass1
Runs FastQC to check the quality of reads in each sample.
mkdir -p {params.proj}.noadpt.fastqc
fastqc {input} -o {params.proj}.noadpt.fastqc {params.extra}
multiqc_pass1
Runs MultiQC to aggregate FastQC reports into one HTML file.
multiqc {params.proj}.noadpt.fastqc -o {params.proj}.noadpt.fastqc/multiqc_report {params.extra}
trim_forward
This is a secondary trimming step using SeqTK to trim/truncate any set number of base pairs from the start/end of each forward read.
mkdir -p {params.trim_trunc_path}
seqtk trimfq -b {params.trim} -e {params.trunc} {params.extra} {input} > {output}
trim_reverse
This is a secondary trimming step using SeqTK to trim/truncate any set number of base pairs from the start/end of each reverse read.
mkdir -p {params.trim_trunc_path}
seqtk trimfq -b {params.trim} -e {params.trunc} {params.extra} {input} > {output}
fastQC_pass2
Runs FastQC to check the quality of reads in each sample.
mkdir -p {params.trim_trunc_path}.fastqc
fastqc {input} -o {params.trim_trunc_path}.fastqc {params.extra}
multiqc_pass2
Runs MultiQC to aggregate FastQC reports into one HTML file.
multiqc {params.trim_trunc_path}.fastqc -o {params.trim_trunc_path}.fastqc/multiqc_report {params.extra}
download_hostile_db
This rule downloads the host index to be used for Hostile’s host read filtering.
if [[ "{params.hostile_db_name}" == "human-t2t-hla-argos985" ]]; then
mkdir -p {params.db_parent_path}
cd {params.db_parent_path}
wget -O {params.hostile_db_name}.tar https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.tar
tar -xvf {params.hostile_db_name}.tar
rm {params.hostile_db_name}.tar
elif [[ "{params.hostile_db_name}" == "human-t2t-hla" ]]; then
mkdir -p {params.db_parent_path}
cd {params.db_parent_path}
wget -O {params.hostile_db_name}.tar https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.tar
tar -xvf {params.hostile_db_name}.tar
rm {params.hostile_db_name}.tar
fi
host_filter
This removes host reads using the tool Hostile.
hostile clean \
--fastq1 {input.FWD} --fastq2 {input.REV} \
-o {params.trim_trunc_path}.nonhost \
--threads {threads} \
--index {params.hostile_db_path} \
--debug \
--aligner bowtie2 \
{params.extra} > {output.REPORT}
# cleanup filepaths
mv {params.trim_trunc_path}.nonhost/{wildcards.sample}.R1.clean_1.fastq.gz {output.FWD}
mv {params.trim_trunc_path}.nonhost/{wildcards.sample}.R2.clean_2.fastq.gz {output.REV}
install_R_packages
This installs R packages needed
Rscript {params.installation_script}
setup_metaphlan
This rule installs the metaphlan database. If the user specifies “latest” or nothing as metaphlan_index_name in the config, it will just download the default latest metaphlan database. Otherwise, users can specify specific versions. The database will be downloaded in the directory specified by metaphlan_bowtie_db in the config file.
mkdir -p {output.loc}
if [ "{params.index_name}" = "latest" ]; then
metaphlan --install --nproc {threads} --bowtie2db {output.loc} {params.extra}
else
metaphlan --install --nproc {threads} --bowtie2db {output.loc} --index {params.index_name} {params.extra}
fi
# Option to do it manually if --install doesn't seem to work
# cd {output}
# Can specify whatever version you want here
# wget http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vOct22_CHOCOPhlAnSGB_202403.tar
# tar -xvf mpa_vOct22_CHOCOPhlAnSGB_202403.tar
# rm mpa_vOct22_CHOCOPhlAnSGB_202403.tar
get_biobakery_chocophlan_db
This rule installs the ChocoPhlAn database used for HUMAnN.
mkdir -p {output}
humann_databases --download chocophlan full {output} --update-config yes {params.extra}
get_biobakery_uniref_db
This rule installs the UniRef database used for HUMAnN.
mkdir -p {output}
humann_databases --download uniref uniref90_diamond {output} --update-config yes {params.extra}
get_utility_mapping_db
This rule installs the utility mapping database used for HUMAnN. This database is used to group/rename UniRef protein families to other versions, such as Enzyme Commission numbers, KEGG Orthologies, etc.
mkdir -p {output}
humann_databases --download utility_mapping full {output} --update-config yes {params.extra}
concat_nonhost_reads
This rule installs concatenates forward and reverse reads into the same file for HUMAnN.
mkdir -p {params.dirpath}
cat {input.FWD} {input.REV} > {output}
run_humann_nonhost
This rule runs the HUMAnN pipeline, including MetaPhlAn.
mkdir -p {params.dirpath}
if [ "{params.metaphlan_index}" = "latest" ]; then
# read index name form the latest file
metaphlan_db_name="$(<{params.metaphlan_bowtie_db}/mpa_latest)"
humann -i {input.NONHUMAN_READS} -o {params.dirpath}/{wildcards.sample} \
--threads {threads} --search-mode uniref90 \
--metaphlan-options "--bowtie2db {params.metaphlan_bowtie_db}" \
{params.extra}
else
humann -i {input.NONHUMAN_READS} -o {params.dirpath}/{wildcards.sample} \
--threads {threads} --search-mode uniref90 \
--metaphlan-options "--bowtie2db {params.metaphlan_bowtie_db} -x {params.metaphlan_index}" \
{params.extra}
fi
aggregate_humann_outs_nonhost
This rule aggregates the HUMAnN and MetaPhlAn outputs across all samples into tsv files with all samples. It also groups protein families and renames them based on Metacyc RXNs, KOs, KEGG Modules, KEGG Pathways, GO terms, EGGNOG, and PFAMs.
humann_join_tables -i {params.dirpath} -o {output.PATHABUND} --file_name pathabundance.tsv --search-subdirectories
humann_join_tables -i {params.dirpath} -o {output.PATHCOV} --file_name pathcoverage.tsv --search-subdirectories
humann_join_tables -i {params.dirpath} -o {output.GENEFAMS} --file_name genefamilies.tsv --search-subdirectories
# Metacyc RXN renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_rxn -o {output.GENEFAMS_GROUPED_RXN}
humann_rename_table -i {output.GENEFAMS_GROUPED_RXN} -n metacyc-rxn -o {output.GENEFAMS_GROUPED_NAMED_RXN}
# KO renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_ko -o {output.GENEFAMS_GROUPED_KO}
humann_rename_table -i {output.GENEFAMS_GROUPED_KO} -n kegg-orthology -o {output.GENEFAMS_GROUPED_NAMED_KO}
humann_rename_table -i {output.GENEFAMS_GROUPED_KO} -n kegg-pathway -o {output.GENEFAMS_GROUPED_NAMED_KP}
humann_rename_table -i {output.GENEFAMS_GROUPED_KO} -n kegg-module -o {output.GENEFAMS_GROUPED_NAMED_KM}
# EGGNOG renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_eggnog -o {output.GENEFAMS_GROUPED_EN}
# GO renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_go -o {output.GENEFAMS_GROUPED_GO}
humann_rename_table -i {output.GENEFAMS_GROUPED_GO} -n go -o {output.GENEFAMS_GROUPED_NAMED_GO}
# PFAM renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_pfam -o {output.GENEFAMS_GROUPED_PFAM}
humann_rename_table -i {output.GENEFAMS_GROUPED_PFAM} -n pfam -o {output.GENEFAMS_GROUPED_NAMED_PFAM}
{params.agg_bugslists} -i {params.dirpath} -o {output.BUGSLIST}
{params.mphlan_conv} -i {params.dirpath}
taxa_barplot
This rule runs a .Rmd file that creates microshaded taxa barplots using MetaPhlan taxonomic profiles.
Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(bugslist='{params.bugslist}', metadata='{params.metadata}', directory='{params.output_dir}'))"
func_barplot_rxn
This rule runs a .Rmd file that creates microshaded functional profile barplots using HUMAnN functional profiles. Specifically, it uses Enzyme Commission numbers from the Metacyc RXN number grouped protein families to show the hierarchical breakdown of enzyme classes in your sample.
Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(genetable='{params.gene_table_rxn}', metadata='{params.metadata}', directory='{params.output_dir}'))"
calc_gut_metabolic_modules
This rule runs a .Rmd file that groups the KOs from HUMAnN into gut metabolic modules with biologically relevant functions. Abundances are summed within modules and then written to a .csv file.
Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(input_file='{params.gene_table_ko}', output_file='{params.gmm_output}'))"
get_kraken_db
This rule downloads the Kraken2 database for taxonomic classification. It also downloads the kmer distribution files used for bracken’s abundance estimation. More databases can be found here: https://benlangmead.github.io/aws-indexes/k2
# Setup directory
mkdir -p {params.database_dir}
cd {params.database_dir}
# Pull and extract prebuilt index
wget https://genome-idx.s3.amazonaws.com/kraken/k2_core_nt_20240904.tar.gz
tar -xzvf k2_core_nt_20240904.tar.gz
rm k2_core_nt_20240904.tar.gz
run_kraken
This rule runs Kraken2 for each sample.
mkdir -p {params.out_dir}
# This has to all be on one line due to the way kraken2 parses it...
kraken2 --gzip-compressed --paired --db {params.database} --threads {threads} --output {output.OUTFILE} --report {output.REPORT} --classified-out {params.out_dir}/{wildcards.sample}_classified#.fq --unclassified-out {params.out_dir}/{wildcards.sample}_unclassified#.fq {params.extra} {input.FWD} {input.REV}
run_bracken
This rule runs Bracken to estimate taxonomic abundance for each sample.
bracken -d {params.database} -i {input.REPORT} -o {output.REPORT} -r 150 -l S -t 10 {params.extra}
aggregate_bracken
This rule aggregates Bracken outputs across samples.
cd {params.dirpath}
# TODO: should probably put this in its own rule, but this works for now
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzvf taxdump.tar.gz
( ls *.bracken ) > bracken-sample-name-map.tsv
bit-combine-bracken-and-add-lineage -i bracken-sample-name-map.tsv -o Combined-taxonomy.tsv -d .
rm *.dmp
rm readme.txt
rm taxdump.tar.gz
taxa_barplot_kraken
This rule runs a .Rmd file that creates microshaded taxa barplots using Kraken/Bracken taxonomic profiles.
Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(bugslist='{params.bugslist}', metadata='{params.metadata}', directory='{params.output_dir}', host_taxon_id='{params.host_taxon_id}'))"
nonpareil
This rule estimates coverage of the nonhost reads based on kmer redundancy.
nonpareil_curves
This rule runs a .Rmd file to visualize coverage of the nonhost fraction across samples.
Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(npo_path='{params.output_dir}', metadata='{params.metadata}'))"
pull_host_genome
This rule downloads the host (default human GRCh38) reference genome and annotation.
mkdir -p {params.ref_dir}
cd {params.ref_dir}
# genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_full_analysis_set.fna GRCh38_full_analysis_set.fna
# annotation
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
gunzip GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
mv GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf GRCh38_full_analysis_set.refseq.gtf
build_host_genome_index
This rule builds an index for mapping the host genome.
mkdir -p {params.ref_dir}
if [ "{params.method}" == "BBMap" ]; then
bbmap.sh ref={input} path={params.ref_dir} threads={threads} -Xmx{resources.mem_mb}m {params.extra}
elif [ "{params.method}" == "HISAT2" ]; then
mkdir -p {output}/
hisat2-build {input} {output}/index -p {threads} {params.extra}
fi
map_host
This rule maps reads to the host genome and compresses SAM files to BAM files.
cd {params.out_dir}
if [ "{params.method}" == "BBMap" ]; then
bbmap.sh in=../{input.FWD} in2=../{input.REV} \
out={wildcards.sample}.sam \
trimreaddescriptions=t \
threads={threads} \
-Xmx{resources.mem_mb}m \
{params.extra}
elif [ "{params.method}" == "HISAT2" ]; then
hisat2 -1 ../{input.FWD} -2 ../{input.REV} \
-S {wildcards.sample}.sam \
-x ref/index \
-p {threads} \
{params.extra}
fi
bash {params.sam2bam_path} {wildcards.sample}.sam
validate_bams
This rule checks if BAM files are valid. It doesn’t stop the pipeline if they aren’t but you can check these files manually.
samtools flagstat --threads {threads} {input.BAM} > {output.BAM_VALID}
generate_feature_counts
This rule uses featureCounts to count host transcripts.
if [ -z {input.BAM} ]
# if there is no input (no host transcriptome mapping), just touch the files
then
touch {output.COUNTS}
touch {output.SUMMARY}
else
featureCounts -T {threads} -p --countReadPairs \
-t exon -g gene_id -a {input.ANNOTATION} -o {output.COUNTS} {params.extra} {input.BAM}
fi
feature_counts_to_tpm
This rule converts featureCounts output to transcripts per million (TPM).
{params.converter_script} {input.COUNTS} --output {output.TPM}
reads_breakdown
This rule aggregates read counts at each step, including raw reads, post-QC reads, host reads, non-host reads, and non-host reads that were unmapped. It outputs a CSV report of this with data per sample.
{params.reporter_script} {input.METADATA} \
--raw_reads_dir {params.proj} \
--hostile_dir {params.hostile} \
--genefams_filepath {input.GENEFAMS} \
--output {output.REPORT}