10  MALT alignment and processing

10.1 Build a custom MALT database based on the previous results from the filtering of KrakenUniq

First we need to build the custom database with the rule Build_Malt_DB. This is done with the use of a python script. Inside the script, python will use the input and output variables provided by snakemake.

WARNING: We won’t be able to run this command during the workshop because the files seqid2taxid.map.orig, library.fna and nucl_gb.accession2taxid were too large for our workshop folder.

rule Build_Malt_DB:
    output:
        seqid2taxid_project="results/MALT_DB/seqid2taxid.project.map",
        seqids_project="results/MALT_DB/seqids.project",
        project_headers="results/MALT_DB/project.headers",
        project_fasta="results/MALT_DB/library.project.fna",
        db=directory("results/MALT_DB/maltDB.dat"),
    input:
        unique_taxids=ancient("results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt"),
    params:
        seqid2taxid=config["malt_seqid2taxid_db"],
        nt_fasta=config["malt_nt_fasta"],
        accession2taxid=config["malt_accession2taxid"],
    log:
        "logs/BUILD_MALT_DB/BUILD_MALT_DB.log",
    script:
        "../scripts/malt-build.py"

Here is a shell version of this rule:

python ../scripts/malt-build.py \
    --seqid2taxid seqid2taxid.map.orig \
    --nt_fasta library.fna \
    --accession2taxid nucl_gb.accession2taxid \
    --unique_taxids results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt \
    --output_dir results/MALT_DB/ \
    --threads 20 &> logs/BUILD_MALT_DB/BUILD_MALT_DB.log

In summary, this command will build a custom malt database containing all the species detected by KrakenUniq that have passed are threshold. You can find the full list of species in results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt.

10.2 Align the FastQ files using MALT

WARNING: We won’t be able to run this command during the workshop. The expected output is provided.

rule Malt:
    output:
        rma6="results/MALT/{sample}.trimmed.rma6",
        sam="results/MALT/{sample}.trimmed.sam.gz",
    input:
        fastq=ancient("results/CUTADAPT_ADAPTER_TRIMMING/{sample}.trimmed.fastq.gz"),
        db=ancient("results/MALT_DB/maltDB.dat"),
    params:
        gunzipped_sam="results/MALT/{sample}.trimmed.sam",
    log:
        "logs/MALT/{sample}.log",
    shell:
        "unset DISPLAY; malt-run -at SemiGlobal -m BlastN -i {input.fastq} -o {output.rma6} -a {params.gunzipped_sam} -t {threads} -d {input.db} &> {log}"

Here is a shell version of this rule:

for sample in $(ls results/CUTADAPT_ADAPTER_TRIMMING/*.fastq.gz); do
        unset DISPLAY
        malt-run -at SemiGlobal -m BlastN -i results/CUTADAPT_ADAPTER_TRIMMING/{sample}.trimmed.fastq.gz \
                -o results/MALT/{sample}.trimmed.rma6 \
                -a results/MALT/{sample}.trimmed.sam \
                -t 20 -d results/MALT_DB/maltDB.dat \
                &> logs/MALT/{sample}.log
        gzip results/MALT/{sample}.trimmed.sam
done

In summary, we loop through the FastQ files and run an alignment with MALT and gzip the generated SAM file. The “unset DISPLAY” command makes it possible to run MALT without it trying to open a graphical interface which can generate an error.

10.3 Quantify the microbial abundance using the SAM file generated by the MALT alignment

rule Malt_QuantifyAbundance:
    output:
        out_dir=directory("results/MALT_QUANTIFY_ABUNDANCE/{sample}"),
        counts="results/MALT_QUANTIFY_ABUNDANCE/{sample}/sam_counts.txt",
    input:
        sam=ancient("results/MALT/{sample}.trimmed.sam.gz"),
    params:
        unique_taxids="results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt",
        exe=WORKFLOW_DIR / "scripts/malt_quantify_abundance.py",
    log:
        "logs/MALT_QUANTIFY_ABUNDANCE/{sample}.log",
    shell:
        "mkdir -p {output.out_dir}; "
        "{params.exe} {input.sam} {params.unique_taxids} > {output.counts} 2> {log}"

Here is a shell version of this rule:

for sample in $(ls results/CUTADAPT_ADAPTER_TRIMMING/*.fastq.gz); do
        python scripts/malt_quantify_abundance.py results/MALT/{sample}.trimmed.sam.gz results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt > results/MALT_QUANTIFY_ABUNDANCE/{sample}/sam_counts.txt 2> logs/MALT_QUANTIFY_ABUNDANCE/{sample}.log
done

In summary, this python script takes the results of the MALT alignment and counts the amount of reads per species.

Please run the script provided for this part like this:

sbatch Malt_QuantifyAbundance.sh --account=your_username

After your job has finished, please check the output file for one sample:

cat results/MALT_QUANTIFY_ABUNDANCE/sample1/sam_counts.txt

10.4 Compute the MALT abundance matrix using the SAM files generated by MALT

rule Malt_AbundanceMatrix_Sam:
    output:
        out_dir=directory("results/MALT_ABUNDANCE_MATRIX_SAM"),
        abundance_matrix="results/MALT_ABUNDANCE_MATRIX_SAM/malt_abundance_matrix_sam.txt",
    input:
        sam_counts=expand(
            "results/MALT_QUANTIFY_ABUNDANCE/{sample}/sam_counts.txt", sample=SAMPLES
        ),
    log:
        "logs/MALT_ABUNDANCE_MATRIX_SAM/MALT_ABUNDANCE_MATRIX_SAM.log",
    params:
        exe=WORKFLOW_DIR / "scripts/malt_abundance_matrix.R",
    shell:
        "Rscript {params.exe} results/MALT_QUANTIFY_ABUNDANCE {output.out_dir} &> {log}"

Here is a shell version of this rule:

for sample in $(ls results/CUTADAPT_ADAPTER_TRIMMING/*.fastq.gz); do
        Rscript scripts/malt_abundance_matrix.R results/MALT_QUANTIFY_ABUNDANCE/ results/MALT_ABUNDANCE_MATRIX_SAM/ > logs/MALT_ABUNDANCE_MATRIX_SAM/MALT_ABUNDANCE_MATRIX_SAM.log 2>&1
done

In summary, this R script takes the abundance of all samples and creates a matrix.

Please run the script provided for this part like this:

sbatch Malt_QuantifyAbundance_Sam.sh --account=your_username

After your job has finished, please check the output file:

cat results/MALT_ABUNDANCE_MATRIX_SAM/malt_abundance_matrix_sam.txt

10.5 Compute the MALT abundance matrix using the rma6 files generated by MALT

rule Malt_AbundanceMatrix_Rma6:
    output:
        out_dir=directory("results/MALT_ABUNDANCE_MATRIX_RMA6"),
        abundance_matrix="results/MALT_ABUNDANCE_MATRIX_RMA6/malt_abundance_matrix_rma6.txt",
    input:
        rma6=expand("results/MALT/{sample}.trimmed.rma6", sample=SAMPLES),
    params:
        exe=WORKFLOW_DIR / "scripts/rma-tabuliser",
    log:
        "logs/MALT_ABUNDANCE_MATRIX_RMA6/MALT_ABUNDANCE_MATRIX_RMA6.log",
    shell:
        "{params.exe} -d $(dirname {input.rma6}) -r 'S' &> {log}; "
        "mv results/MALT/count_table.tsv {output.out_dir}; "
        "mv {output.out_dir}/count_table.tsv {output.abundance_matrix}"

Here is a shell version of this rule:

for sample in $(ls results/CUTADAPT_ADAPTER_TRIMMING/*.fastq.gz); do
    sample_name=$(basename $sample .trimmed.fastq.gz)
    ./scripts/rma-tabuliser -d results/MALT/${sample_name}.trimmed.rma6 -r 'S' &> logs/MALT_ABUNDANCE_MATRIX_RMA6/MALT_ABUNDANCE_MATRIX_RMA6.log;
    mv results/MALT/count_table.tsv results/MALT_ABUNDANCE_MATRIX_RMA6/;
    mv results/MALT_ABUNDANCE_MATRIX_RMA6/count_table.tsv results/MALT_ABUNDANCE_MATRIX_RMA6/malt_abundance_matrix_rma6.txt
done

In summary, we use the tool rma-tabuliser to process the rma6 file generated by the MALT alignment. This file contains LCA (latest common ancestor) information for the reads. This command will generate a taxonomy table, with node names and filtered at the species (S) level inside the MALT folder called count_table.tsv. We then move that table to the designated output folder of our rule. We also rename the file to better describe what it contains.

Please run the script provided for this part like this:

sbatch Malt_AbundanceMatrix_Rma6.sh --account=your_username

After your job has finished, please check the output file:

cat results/MALT_ABUNDANCE_MATRIX_SAM/malt_abundance_matrix_sam.txt