Rule descriptions

This is an automatically generated list of all supported rules, their docstrings, command, and virtual environment. At the start of each workflow run a list is printed of which rules will be run. And while the workflow is running it prints which rules are being started and finished. This page is here to give an explanation to the user about what each rule does, and for developers to find what is, and isn’t yet supported.

all

This contains most files needed for endpoints to be reached.

symlink_fastqs

Creates a symbolic link to the fastq files, so files can easily be accessed.

remove_adapters

Trims common sequencing adapters (using adapters found in data/adapters.fa) from the fastq reads. Also performs quality-based trimming at the start and end of reads, and reads shorter than minlen after quality-based trimming are removed entirely.

mkdir -p {params.proj}.noadpt/{wildcards.sample}
trimmomatic PE -threads {threads} \
    -trimlog {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimlog \
    -summary {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.summary \
    -validatePairs {input.FORWARD} {input.REVERSE} \
    -baseout {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed \
    ILLUMINACLIP:{params.adpt}:2:30:10 SLIDINGWINDOW:4:20 LEADING:{params.leading} TRAILING:{params.trailing} MINLEN:{params.minlen} \
    -phred33 {params.extra}

mv {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed_1P {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed.R1.fq
mv {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed_2P {params.proj}.noadpt/{wildcards.sample}/{wildcards.sample}.trimmed.R2.fq

echo "completed adapter trimming of {wildcards.sample}"

remove_adapters environment.

fastQC_pass1

Runs FastQC to check the quality of reads in each sample.

mkdir -p {params.proj}.noadpt.fastqc
fastqc {input} -o {params.proj}.noadpt.fastqc {params.extra}

fastQC_pass1 environment.

multiqc_pass1

Runs MultiQC to aggregate FastQC reports into one HTML file.

multiqc {params.proj}.noadpt.fastqc -o {params.proj}.noadpt.fastqc/multiqc_report {params.extra}

multiqc_pass1 environment.

trim_forward

This is a secondary trimming step using SeqTK to trim/truncate any set number of base pairs from the start/end of each forward read.

mkdir -p {params.trim_trunc_path}
seqtk trimfq -b {params.trim} -e {params.trunc} {params.extra} {input} > {output} 

trim_forward environment.

trim_reverse

This is a secondary trimming step using SeqTK to trim/truncate any set number of base pairs from the start/end of each reverse read.

mkdir -p {params.trim_trunc_path}
seqtk trimfq -b {params.trim} -e {params.trunc} {params.extra} {input} > {output}

trim_reverse environment.

fastQC_pass2

Runs FastQC to check the quality of reads in each sample.

mkdir -p {params.trim_trunc_path}.fastqc
fastqc {input} -o {params.trim_trunc_path}.fastqc {params.extra}

fastQC_pass2 environment.

multiqc_pass2

Runs MultiQC to aggregate FastQC reports into one HTML file.

multiqc {params.trim_trunc_path}.fastqc -o {params.trim_trunc_path}.fastqc/multiqc_report {params.extra}

multiqc_pass2 environment.

download_hostile_db

This rule downloads the host index to be used for Hostile’s host read filtering.

if [[ "{params.hostile_db_name}" == "human-t2t-hla-argos985" ]]; then
  mkdir -p {params.db_parent_path}
  cd {params.db_parent_path}
  wget -O {params.hostile_db_name}.tar https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.tar
  tar -xvf {params.hostile_db_name}.tar
  rm {params.hostile_db_name}.tar
elif [[ "{params.hostile_db_name}" == "human-t2t-hla" ]]; then
  mkdir -p {params.db_parent_path}
  cd {params.db_parent_path}
  wget -O {params.hostile_db_name}.tar https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.tar
  tar -xvf {params.hostile_db_name}.tar
  rm {params.hostile_db_name}.tar
fi

host_filter

This removes host reads using the tool Hostile.

hostile clean \
--fastq1 {input.FWD} --fastq2 {input.REV} \
-o {params.trim_trunc_path}.nonhost \
--threads {threads} \
--index {params.hostile_db_path} \
--debug \
--aligner bowtie2 \
{params.extra} > {output.REPORT}

# cleanup filepaths
mv {params.trim_trunc_path}.nonhost/{wildcards.sample}.R1.clean_1.fastq.gz {output.FWD}
mv {params.trim_trunc_path}.nonhost/{wildcards.sample}.R2.clean_2.fastq.gz {output.REV}

host_filter environment.

install_R_packages

This installs R packages needed

Rscript {params.installation_script}

install_R_packages environment.

setup_metaphlan

This rule installs the metaphlan database. If the user specifies “latest” or nothing as metaphlan_index_name in the config, it will just download the default latest metaphlan database. Otherwise, users can specify specific versions. The database will be downloaded in the directory specified by metaphlan_bowtie_db in the config file.

mkdir -p {output.loc}

if [ "{params.index_name}" = "latest" ]; then
  metaphlan --install --nproc {threads} --bowtie2db {output.loc} {params.extra}

else
  metaphlan --install --nproc {threads} --bowtie2db {output.loc} --index {params.index_name} {params.extra}

fi

# Option to do it manually if --install doesn't seem to work
# cd {output}
# Can specify whatever version you want here
# wget http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vOct22_CHOCOPhlAnSGB_202403.tar
# tar -xvf mpa_vOct22_CHOCOPhlAnSGB_202403.tar
# rm mpa_vOct22_CHOCOPhlAnSGB_202403.tar

setup_metaphlan environment.

get_biobakery_chocophlan_db

This rule installs the ChocoPhlAn database used for HUMAnN.

mkdir -p {output}
humann_databases --download chocophlan full {output} --update-config yes {params.extra}

get_biobakery_chocophlan_db environment.

get_biobakery_uniref_db

This rule installs the UniRef database used for HUMAnN.

mkdir -p {output}
humann_databases --download uniref uniref90_diamond {output} --update-config yes {params.extra}

get_biobakery_uniref_db environment.

get_utility_mapping_db

This rule installs the utility mapping database used for HUMAnN. This database is used to group/rename UniRef protein families to other versions, such as Enzyme Commission numbers, KEGG Orthologies, etc.

mkdir -p {output}
humann_databases --download utility_mapping full {output} --update-config yes {params.extra}

get_utility_mapping_db environment.

concat_nonhost_reads

This rule installs concatenates forward and reverse reads into the same file for HUMAnN.

mkdir -p {params.dirpath}
cat {input.FWD} {input.REV} > {output}

run_humann_nonhost

This rule runs the HUMAnN pipeline, including MetaPhlAn.

mkdir -p {params.dirpath}

if [ "{params.metaphlan_index}" = "latest" ]; then
  # read index name form the latest file
  metaphlan_db_name="$(<{params.metaphlan_bowtie_db}/mpa_latest)"
  
  humann -i {input.NONHUMAN_READS} -o {params.dirpath}/{wildcards.sample} \
  --threads {threads} --search-mode uniref90 \
  --metaphlan-options "--bowtie2db {params.metaphlan_bowtie_db}" \
  {params.extra}

else
  humann -i {input.NONHUMAN_READS} -o {params.dirpath}/{wildcards.sample} \
  --threads {threads} --search-mode uniref90 \
  --metaphlan-options "--bowtie2db {params.metaphlan_bowtie_db} -x {params.metaphlan_index}" \
  {params.extra}

fi

run_humann_nonhost environment.

aggregate_humann_outs_nonhost

This rule aggregates the HUMAnN and MetaPhlAn outputs across all samples into tsv files with all samples. It also groups protein families and renames them based on Metacyc RXNs, KOs, KEGG Modules, KEGG Pathways, GO terms, EGGNOG, and PFAMs.

humann_join_tables -i {params.dirpath} -o {output.PATHABUND} --file_name pathabundance.tsv --search-subdirectories

humann_join_tables -i {params.dirpath} -o {output.PATHCOV} --file_name pathcoverage.tsv --search-subdirectories

humann_join_tables -i {params.dirpath} -o {output.GENEFAMS} --file_name genefamilies.tsv --search-subdirectories

# Metacyc RXN renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_rxn -o {output.GENEFAMS_GROUPED_RXN}
humann_rename_table -i {output.GENEFAMS_GROUPED_RXN} -n metacyc-rxn -o {output.GENEFAMS_GROUPED_NAMED_RXN}

# KO renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_ko -o {output.GENEFAMS_GROUPED_KO}
humann_rename_table -i {output.GENEFAMS_GROUPED_KO} -n kegg-orthology -o {output.GENEFAMS_GROUPED_NAMED_KO}
humann_rename_table -i {output.GENEFAMS_GROUPED_KO} -n kegg-pathway -o {output.GENEFAMS_GROUPED_NAMED_KP}
humann_rename_table -i {output.GENEFAMS_GROUPED_KO} -n kegg-module -o {output.GENEFAMS_GROUPED_NAMED_KM}

# EGGNOG renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_eggnog -o {output.GENEFAMS_GROUPED_EN}

# GO renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_go -o {output.GENEFAMS_GROUPED_GO}
humann_rename_table -i {output.GENEFAMS_GROUPED_GO} -n go -o {output.GENEFAMS_GROUPED_NAMED_GO}

# PFAM renaming
humann_regroup_table -i {output.GENEFAMS} -g uniref90_pfam -o {output.GENEFAMS_GROUPED_PFAM}
humann_rename_table -i {output.GENEFAMS_GROUPED_PFAM} -n pfam -o {output.GENEFAMS_GROUPED_NAMED_PFAM}

{params.agg_bugslists} -i {params.dirpath} -o {output.BUGSLIST}

{params.mphlan_conv} -i {params.dirpath} 

aggregate_humann_outs_nonhost environment.

taxa_barplot

This rule runs a .Rmd file that creates microshaded taxa barplots using MetaPhlan taxonomic profiles.

Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(bugslist='{params.bugslist}', metadata='{params.metadata}', directory='{params.output_dir}'))"

taxa_barplot environment.

func_barplot_rxn

This rule runs a .Rmd file that creates microshaded functional profile barplots using HUMAnN functional profiles. Specifically, it uses Enzyme Commission numbers from the Metacyc RXN number grouped protein families to show the hierarchical breakdown of enzyme classes in your sample.

Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(genetable='{params.gene_table_rxn}', metadata='{params.metadata}', directory='{params.output_dir}'))"

func_barplot_rxn environment.

calc_gut_metabolic_modules

This rule runs a .Rmd file that groups the KOs from HUMAnN into gut metabolic modules with biologically relevant functions. Abundances are summed within modules and then written to a .csv file.

Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(input_file='{params.gene_table_ko}', output_file='{params.gmm_output}'))"

calc_gut_metabolic_modules environment.

get_kraken_db

This rule downloads the Kraken2 database for taxonomic classification. It also downloads the kmer distribution files used for bracken’s abundance estimation. More databases can be found here: https://benlangmead.github.io/aws-indexes/k2

# Setup directory
mkdir -p {params.database_dir}
cd {params.database_dir}

# Pull and extract prebuilt index
wget https://genome-idx.s3.amazonaws.com/kraken/k2_core_nt_20240904.tar.gz
tar -xzvf k2_core_nt_20240904.tar.gz

rm k2_core_nt_20240904.tar.gz

get_kraken_db environment.

run_kraken

This rule runs Kraken2 for each sample.

mkdir -p {params.out_dir}

# This has to all be on one line due to the way kraken2 parses it...
kraken2 --gzip-compressed --paired --db {params.database} --threads {threads} --output {output.OUTFILE} --report {output.REPORT} --classified-out {params.out_dir}/{wildcards.sample}_classified#.fq --unclassified-out {params.out_dir}/{wildcards.sample}_unclassified#.fq {params.extra} {input.FWD} {input.REV}

run_kraken environment.

run_bracken

This rule runs Bracken to estimate taxonomic abundance for each sample.

bracken -d {params.database} -i {input.REPORT} -o {output.REPORT} -r 150 -l S -t 10 {params.extra}

run_bracken environment.

aggregate_bracken

This rule aggregates Bracken outputs across samples.

cd {params.dirpath}

# TODO: should probably put this in its own rule, but this works for now
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzvf taxdump.tar.gz

( ls *.bracken ) > bracken-sample-name-map.tsv

bit-combine-bracken-and-add-lineage -i bracken-sample-name-map.tsv -o Combined-taxonomy.tsv -d .

rm *.dmp
rm readme.txt
rm taxdump.tar.gz

aggregate_bracken environment.

taxa_barplot_kraken

This rule runs a .Rmd file that creates microshaded taxa barplots using Kraken/Bracken taxonomic profiles.

Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(bugslist='{params.bugslist}', metadata='{params.metadata}', directory='{params.output_dir}', host_taxon_id='{params.host_taxon_id}'))"

taxa_barplot_kraken environment.

nonpareil

This rule estimates coverage of the nonhost reads based on kmer redundancy.

nonpareil_curves

This rule runs a .Rmd file to visualize coverage of the nonhost fraction across samples.

Rscript \
-e "rmarkdown::render('{params.rmd_path}', output_dir='{params.output_dir}', params=list(npo_path='{params.output_dir}', metadata='{params.metadata}'))"

nonpareil_curves environment.

pull_host_genome

This rule downloads the host (default human GRCh38) reference genome and annotation.

mkdir -p {params.ref_dir}
cd {params.ref_dir}

# genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_full_analysis_set.fna GRCh38_full_analysis_set.fna

# annotation
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
gunzip GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
mv GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf GRCh38_full_analysis_set.refseq.gtf

build_host_genome_index

This rule builds an index for mapping the host genome.

mkdir -p {params.ref_dir}
if [ "{params.method}" == "BBMap" ]; then
    bbmap.sh ref={input} path={params.ref_dir} threads={threads} -Xmx{resources.mem_mb}m {params.extra}
elif [ "{params.method}" == "HISAT2" ]; then
    mkdir -p {output}/
    hisat2-build {input} {output}/index -p {threads} {params.extra}
fi

build_host_genome_index environment.

map_host

This rule maps reads to the host genome and compresses SAM files to BAM files.

cd {params.out_dir}

if [ "{params.method}" == "BBMap" ]; then
  bbmap.sh in=../{input.FWD} in2=../{input.REV} \
  out={wildcards.sample}.sam \
  trimreaddescriptions=t \
  threads={threads} \
  -Xmx{resources.mem_mb}m \
  {params.extra}

elif [ "{params.method}" == "HISAT2" ]; then
    hisat2 -1 ../{input.FWD} -2 ../{input.REV} \
    -S {wildcards.sample}.sam \
    -x ref/index \
    -p {threads} \
    {params.extra}
fi

bash {params.sam2bam_path} {wildcards.sample}.sam

map_host environment.

validate_bams

This rule checks if BAM files are valid. It doesn’t stop the pipeline if they aren’t but you can check these files manually.

samtools flagstat --threads {threads} {input.BAM} > {output.BAM_VALID}

validate_bams environment.

generate_feature_counts

This rule uses featureCounts to count host transcripts.

if [ -z {input.BAM} ]
# if there is no input (no host transcriptome mapping), just touch the files
then
  touch {output.COUNTS}
  touch {output.SUMMARY}
else
  featureCounts -T {threads} -p --countReadPairs \
  -t exon -g gene_id -a {input.ANNOTATION} -o {output.COUNTS} {params.extra} {input.BAM}
fi

generate_feature_counts environment.

feature_counts_to_tpm

This rule converts featureCounts output to transcripts per million (TPM).

{params.converter_script} {input.COUNTS} --output {output.TPM}

feature_counts_to_tpm environment.

reads_breakdown

This rule aggregates read counts at each step, including raw reads, post-QC reads, host reads, non-host reads, and non-host reads that were unmapped. It outputs a CSV report of this with data per sample.

{params.reporter_script} {input.METADATA} \
--raw_reads_dir {params.proj} \
--hostile_dir {params.hostile} \
--genefams_filepath {input.GENEFAMS} \
--output {output.REPORT}

reads_breakdown environment.