Rule descriptions

This is an automatically generated list of all supported rules, their docstrings, command, and virtual environment. At the start of each workflow run a list is printed of which rules will be run. And while the workflow is running it prints which rules are being started and finished. This page is here to give an explanation to the user about what each rule does, and for developers to find what is, and isn’t yet supported.

all

This contains most files needed for endpoints to be reached.

remove_adapters

Trims common sequencing adapters (using adapters found in data/adapters.fa) from the fastq reads. Also performs quality-based trimming at the start and end of reads, and reads shorter than minlen after quality-based trimming are removed entirely.

mkdir -p {params.proj}.noadpt/{wildcards.sample}
trimmomatic PE -threads {threads} \
    -trimlog {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimlog \
    -summary {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.summary \
    -validatePairs {input.FORWARD} {input.REVERSE} \
    -baseout {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed \
    ILLUMINACLIP:{params.adpt}:2:30:10 SLIDINGWINDOW:4:20 LEADING:{params.leading} TRAILING:{params.trailing} MINLEN:{params.minlen} \
    -phred33 {params.extra}

mv {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed_1P {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed.R1.fq
mv {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed_2P {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed.R2.fq

echo "completed adapter trimming of {wildcards.sample}"

remove_adapters environment.

fastQC_pass1

Runs FastQC to check the quality of reads in each sample.

mkdir -p {params.proj}.noadpt.fastqc
fastqc {input} -o {params.proj}.noadpt.fastqc {params.extra}

fastQC_pass1 environment.

multiqc_pass1

Runs MultiQC to aggregate FastQC reports into one HTML file.

multiqc {params.proj}.noadpt.fastqc -o {params.proj}.noadpt.fastqc/multiqc_report {params.extra}

multiqc_pass1 environment.

trim_forward

This is a secondary trimming step using SeqTK to trim/truncate any set number of base pairs from the start/end of each forward read.

mkdir -p {params.trim_trunc_path}
seqtk trimfq -b {params.trim} -e {params.trunc} {params.extra} {input} > {output} 

trim_forward environment.

trim_reverse

This is a secondary trimming step using SeqTK to trim/truncate any set number of base pairs from the start/end of each reverse read.

mkdir -p {params.trim_trunc_path}
seqtk trimfq -b {params.trim} -e {params.trunc} {params.extra} {input} > {output}

trim_reverse environment.

fastQC_pass2

Runs FastQC to check the quality of reads in each sample.

mkdir -p {params.trim_trunc_path}.fastqc
fastqc {input} -o {params.trim_trunc_path}.fastqc {params.extra}

fastQC_pass2 environment.

multiqc_pass2

Runs MultiQC to aggregate FastQC reports into one HTML file.

multiqc {params.trim_trunc_path}.fastqc -o {params.trim_trunc_path}.fastqc/multiqc_report {params.extra}

multiqc_pass2 environment.

download_hostile_db

This rule downloads the host index to be used for Hostile’s host read filtering.

if [[ "{params.hostile_db_name}" == "human-t2t-hla-argos985" ]]; then
  mkdir -p {params.db_parent_path}
  cd {params.db_parent_path}
  wget -O {params.hostile_db_name}.tar https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.tar
  tar -xvf {params.hostile_db_name}.tar
  rm {params.hostile_db_name}.tar
elif [[ "{params.hostile_db_name}" == "human-t2t-hla" ]]; then
  mkdir -p {params.db_parent_path}
  cd {params.db_parent_path}
  wget -O {params.hostile_db_name}.tar https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.tar
  tar -xvf {params.hostile_db_name}.tar
  rm {params.hostile_db_name}.tar
fi

host_filter

This removes host reads using the tool Hostile.

hostile clean \
--fastq1 {input.FWD} --fastq2 {input.REV} \
-o {params.trim_trunc_path}.nonhost \
--threads {threads} \
--index {params.hostile_db_path} \
--debug \
--aligner bowtie2 \
{params.extra} > {output.REPORT}

# cleanup filepaths
mv {params.trim_trunc_path}.nonhost/{wildcards.sample}.R1.clean_1.fastq.gz {output.FWD}
mv {params.trim_trunc_path}.nonhost/{wildcards.sample}.R2.clean_2.fastq.gz {output.REV}

host_filter environment.

install_R_packages

This installs R packages needed

Rscript {params.installation_script}

install_R_packages environment.

setup_metaphlan

This rule installs the metaphlan database. If the user specifies “latest” or nothing as metaphlan_index_name in the config, it will just download the default latest metaphlan database. Otherwise, users can specify specific versions. The database will be downloaded in the directory specified by metaphlan_bowtie_db in the config file.

mkdir -p {output.loc}

if [ "{params.index_name}" = "latest" ]; then
  metaphlan --install --nproc {threads} --bowtie2db {output.loc} {params.extra}

else
  metaphlan --install --nproc {threads} --bowtie2db {output.loc} --index {params.index_name} {params.extra}

fi

# Option to do it manually if --install doesn't seem to work
# cd {output}
# Can specify whatever version you want here
# wget http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vOct22_CHOCOPhlAnSGB_202403.tar
# tar -xvf mpa_vOct22_CHOCOPhlAnSGB_202403.tar
# rm mpa_vOct22_CHOCOPhlAnSGB_202403.tar

setup_metaphlan environment.

get_biobakery_chocophlan_db

This rule installs the ChocoPhlAn database used for HUMAnN.

mkdir -p {output}
humann_databases --download chocophlan full {output} --update-config yes {params.extra}

get_biobakery_chocophlan_db environment.

get_biobakery_uniref_db

This rule installs the UniRef database used for HUMAnN.

mkdir -p {output}
humann_databases --download uniref uniref90_diamond {output} --update-config yes {params.extra}

get_biobakery_uniref_db environment.

get_utility_mapping_db

This rule installs the utility mapping database used for HUMAnN. This database is used to group/rename UniRef protein families to other versions, such as Enzyme Commission numbers, KEGG Orthologies, etc.

mkdir -p {output}
humann_databases --download utility_mapping full {output} --update-config yes {params.extra}

get_utility_mapping_db environment.

concat_nonhost_reads

This rule installs concatenates forward and reverse reads into the same file for HUMAnN.

mkdir -p {params.dirpath}
cat {input.FWD} {input.REV} > {output}

run_humann_nonhost

This rule runs the HUMAnN pipeline, including MetaPhlAn.

mkdir -p {params.dirpath}

if [ "{params.metaphlan_index}" = "latest" ]; then
  # read index name form the latest file
  metaphlan_db_name="$(<{params.metaphlan_bowtie_db}/mpa_latest)"
  
  humann -i {input.NONHUMAN_READS} -o {params.dirpath}/{wildcards.sample} \
  --threads {threads} --search-mode uniref90 \
  --metaphlan-options "--bowtie2db {params.metaphlan_bowtie_db}" \
  {params.extra}

else
  humann -i {input.NONHUMAN_READS} -o {params.dirpath}/{wildcards.sample} \
  --threads {threads} --search-mode uniref90 \
  --metaphlan-options "--bowtie2db {params.metaphlan_bowtie_db} -x {params.metaphlan_index}" \
  {params.extra}

fi

run_humann_nonhost environment.

aggregate_humann_outs_nonhost

This rule aggregates the HUMAnN and MetaPhlAn outputs across all samples into tsv files with all samples. It also groups protein families and renames them based on Metacyc RXNs, KOs, KEGG Modules, KEGG Pathways, GO terms, EGGNOG, and PFAMs.

humann_join_tables -i {params.dirpath} -o {output.PATHABUND} --file_name pathabundance.tsv --search-subdirectories

humann_join_tables -i {params.dirpath} -o {output.PATHCOV} --file_name pathcoverage.tsv --search-subdirectories

humann_join_tables -i {params.dirpath} -o {output.GENEFAMS} --file_name genefamilies.tsv --search-subdirectories

# Metacyc RXN renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_rxn -o {output.GENEFAMS_GROUPED_RXN}
humann_rename_table -i {output.GENEFAMS_GROUPED_RXN} -n metacyc-rxn -o {output.GENEFAMS_GROUPED_NAMED_RXN}

# KO renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_ko -o {output.GENEFAMS_GROUPED_KO}
humann_rename_table -i {output.GENEFAMS_GROUPED_KO} -n kegg-orthology -o {output.GENEFAMS_GROUPED_NAMED_KO}
humann_rename_table -i {output.GENEFAMS_GROUPED_KO} -n kegg-pathway -o {output.GENEFAMS_GROUPED_NAMED_KP}
humann_rename_table -i {output.GENEFAMS_GROUPED_KO} -n kegg-module -o {output.GENEFAMS_GROUPED_NAMED_KM}

# EGGNOG renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_eggnog -o {output.GENEFAMS_GROUPED_EN}

# GO renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_go -o {output.GENEFAMS_GROUPED_GO}
humann_rename_table -i {output.GENEFAMS_GROUPED_GO} -n go -o {output.GENEFAMS_GROUPED_NAMED_GO}

# PFAM renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_pfam -o {output.GENEFAMS_GROUPED_PFAM}
humann_rename_table -i {output.GENEFAMS_GROUPED_PFAM} -n pfam -o {output.GENEFAMS_GROUPED_NAMED_PFAM}

{params.agg_bugslists} -i {params.dirpath} -o {output.BUGSLIST}

{params.mphlan_conv} -i {params.dirpath} 

aggregate_humann_outs_nonhost environment.

taxa_barplot

This rule runs a .Rmd file that creates microshaded taxa barplots using MetaPhlan taxonomic profiles.

Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(bugslist='{params.bugslist}', metadata='{params.metadata}', directory='{params.output_dir}'))"

taxa_barplot environment.

func_barplot_rxn

This rule runs a .Rmd file that creates microshaded functional profile barplots using HUMAnN functional profiles. Specifically, it uses Enzyme Commission numbers from the Metacyc RXN number grouped protein families to show the hierarchical breakdown of enzyme classes in your sample.

Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(genetable='{params.gene_table_rxn}', metadata='{params.metadata}', directory='{params.output_dir}'))"

func_barplot_rxn environment.

calc_gut_metabolic_modules

This rule runs a .Rmd file that groups the KOs from HUMAnN into gut metabolic modules with biologically relevant functions. Abundances are summed within modules and then written to a .csv file.

Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(input_file='{params.gene_table_ko}', output_file='{params.gmm_output}'))"

calc_gut_metabolic_modules environment.

get_kraken_db

This rule downloads the Kraken2 database for taxonomic classification. It also downloads the kmer distribution files used for bracken’s abundance estimation. More databases can be found here: https://benlangmead.github.io/aws-indexes/k2

# Setup directory
mkdir -p {params.database_dir}
cd {params.database_dir}

# Pull and extract prebuilt index
wget https://genome-idx.s3.amazonaws.com/kraken/k2_core_nt_20240904.tar.gz
tar -xzvf k2_core_nt_20240904.tar.gz

rm k2_core_nt_20240904.tar.gz

get_kraken_db environment.

run_kraken

This rule runs Kraken2 for each sample.

mkdir -p {params.out_dir}

# This has to all be on one line due to the way kraken2 parses it...
kraken2 --gzip-compressed --paired --db {params.database} --threads {threads} --output {output.OUTFILE} --report {output.REPORT} --classified-out {params.out_dir}/{wildcards.sample}_classified#.fq --unclassified-out {params.out_dir}/{wildcards.sample}_unclassified#.fq {params.extra} {input.FWD} {input.REV}

run_kraken environment.

run_bracken

This rule runs Bracken to estimate taxonomic abundance for each sample.

bracken -d {params.database} -i {input.REPORT} -o {output.REPORT} -r 150 -l S -t 10 {params.extra}

run_bracken environment.

aggregate_bracken

This rule aggregates Bracken outputs across samples.

cd {params.dirpath}

# TODO: should probably put this in its own rule, but this works for now
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzvf taxdump.tar.gz

( ls *.bracken ) > bracken-sample-name-map.tsv

bit-combine-bracken-and-add-lineage -i bracken-sample-name-map.tsv -o Combined-taxonomy.tsv -d .

rm *.dmp
rm readme.txt
rm taxdump.tar.gz

aggregate_bracken environment.

taxa_barplot_kraken

This rule runs a .Rmd file that creates microshaded taxa barplots using Kraken/Bracken taxonomic profiles.

Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(bugslist='{params.bugslist}', metadata='{params.metadata}', directory='{params.output_dir}', host_taxon_id='{params.host_taxon_id}'))"

taxa_barplot_kraken environment.

nonpareil

This rule estimates coverage of the nonhost reads based on kmer redundancy.

nonpareil_curves

This rule runs a .Rmd file to visualize coverage of the nonhost fraction across samples.

Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(npo_path='{params.output_dir}', metadata='{params.metadata}'))"

nonpareil_curves environment.

pull_host_genome

This rule downloads the host (default human GRCh38) reference genome and annotation.

mkdir -p {params.ref_dir}
cd {params.ref_dir}

# genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_full_analysis_set.fna GRCh38_full_analysis_set.fna

# annotation
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
gunzip GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
mv GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf GRCh38_full_analysis_set.refseq.gtf

build_host_genome_index

This rule builds an index for mapping the host genome.

mkdir -p {params.ref_dir}
if [ "{params.method}" == "BBMap" ]; then
    bbmap.sh ref={input} path={params.ref_dir} threads={threads} -Xmx{resources.mem_mb}m {params.extra}
elif [ "{params.method}" == "HISAT2" ]; then
    mkdir -p {output}/
    hisat2-build {input} {output}/index -p {threads} {params.extra}
fi

build_host_genome_index environment.

map_host

This rule maps reads to the host genome and compresses SAM files to BAM files.

cd {params.out_dir}

if [ "{params.method}" == "BBMap" ]; then
  bbmap.sh in=../{input.FWD} in2=../{input.REV} \
  out={wildcards.sample}.sam \
  trimreaddescriptions=t \
  threads={threads} \
  -Xmx{resources.mem_mb}m \
  {params.extra}

elif [ "{params.method}" == "HISAT2" ]; then
    hisat2 -1 ../{input.FWD} -2 ../{input.REV} \
    -S {wildcards.sample}.sam \
    -x ref/index \
    -p {threads} \
    {params.extra}
fi

bash {params.sam2bam_path} {wildcards.sample}.sam

map_host environment.

validate_bams

This rule checks if BAM files are valid. It doesn’t stop the pipeline if they aren’t but you can check these files manually.

samtools flagstat --threads {threads} {input.BAM} > {output.BAM_VALID}

validate_bams environment.

generate_feature_counts

This rule uses featureCounts to count host transcripts.

if [ -z {input.BAM} ]
# if there is no input (no host transcriptome mapping), just touch the files
then
  touch {output.COUNTS}
  touch {output.SUMMARY}
else
  featureCounts -T {threads} -p --countReadPairs \
  -t exon -g gene_id -a {input.ANNOTATION} -o {output.COUNTS} {params.extra} {input.BAM}
fi

generate_feature_counts environment.

feature_counts_to_tpm

This rule converts featureCounts output to transcripts per million (TPM).

{params.converter_script} {input.COUNTS} --output {output.TPM}

feature_counts_to_tpm environment.

reads_breakdown

This rule aggregates read counts at each step, including raw reads, post-QC reads, host reads, non-host reads, and non-host reads that were unmapped. It outputs a CSV report of this with data per sample.

{params.reporter_script} {input.METADATA} \
--raw_reads_dir {params.proj} \
--hostile_dir {params.hostile} \
--genefams_filepath {input.GENEFAMS} \
--output {output.REPORT}

reads_breakdown environment.