10 MALT alignment and processing
10.1 Build a custom MALT database based on the previous results from the filtering of KrakenUniq
First we need to build the custom database with the rule Build_Malt_DB. This is done with the use of a python script. Inside the script, python will use the input and output variables provided by snakemake.
WARNING: We won’t be able to run this command during the workshop because the files seqid2taxid.map.orig, library.fna and nucl_gb.accession2taxid were too large for our workshop folder.
rule Build_Malt_DB:
output:
seqid2taxid_project="results/MALT_DB/seqid2taxid.project.map",
seqids_project="results/MALT_DB/seqids.project",
project_headers="results/MALT_DB/project.headers",
project_fasta="results/MALT_DB/library.project.fna",
db=directory("results/MALT_DB/maltDB.dat"),
input:
unique_taxids=ancient("results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt"),
params:
seqid2taxid=config["malt_seqid2taxid_db"],
nt_fasta=config["malt_nt_fasta"],
accession2taxid=config["malt_accession2taxid"],
log:
"logs/BUILD_MALT_DB/BUILD_MALT_DB.log",
script:
"../scripts/malt-build.py"
Here is a shell version of this rule:
python ../scripts/malt-build.py \
--seqid2taxid seqid2taxid.map.orig \
--nt_fasta library.fna \
--accession2taxid nucl_gb.accession2taxid \
--unique_taxids results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt \
--output_dir results/MALT_DB/ \
--threads 20 &> logs/BUILD_MALT_DB/BUILD_MALT_DB.log
In summary, this command will build a custom malt database containing all the species detected by KrakenUniq that have passed are threshold. You can find the full list of species in results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt.
10.2 Align the FastQ files using MALT
WARNING: We won’t be able to run this command during the workshop. The expected output is provided.
rule Malt:
output:
rma6="results/MALT/{sample}.trimmed.rma6",
sam="results/MALT/{sample}.trimmed.sam.gz",
input:
fastq=ancient("results/CUTADAPT_ADAPTER_TRIMMING/{sample}.trimmed.fastq.gz"),
db=ancient("results/MALT_DB/maltDB.dat"),
params:
gunzipped_sam="results/MALT/{sample}.trimmed.sam",
log:
"logs/MALT/{sample}.log",
shell:
"unset DISPLAY; malt-run -at SemiGlobal -m BlastN -i {input.fastq} -o {output.rma6} -a {params.gunzipped_sam} -t {threads} -d {input.db} &> {log}"
Here is a shell version of this rule:
for sample in $(ls results/CUTADAPT_ADAPTER_TRIMMING/*.fastq.gz); do
unset DISPLAY
malt-run -at SemiGlobal -m BlastN -i results/CUTADAPT_ADAPTER_TRIMMING/{sample}.trimmed.fastq.gz \
-o results/MALT/{sample}.trimmed.rma6 \
-a results/MALT/{sample}.trimmed.sam \
-t 20 -d results/MALT_DB/maltDB.dat \
&> logs/MALT/{sample}.log
gzip results/MALT/{sample}.trimmed.sam
done
In summary, we loop through the FastQ files and run an alignment with MALT and gzip the generated SAM file. The “unset DISPLAY” command makes it possible to run MALT without it trying to open a graphical interface which can generate an error.
10.3 Quantify the microbial abundance using the SAM file generated by the MALT alignment
rule Malt_QuantifyAbundance:
output:
out_dir=directory("results/MALT_QUANTIFY_ABUNDANCE/{sample}"),
counts="results/MALT_QUANTIFY_ABUNDANCE/{sample}/sam_counts.txt",
input:
sam=ancient("results/MALT/{sample}.trimmed.sam.gz"),
params:
unique_taxids="results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt",
exe=WORKFLOW_DIR / "scripts/malt_quantify_abundance.py",
log:
"logs/MALT_QUANTIFY_ABUNDANCE/{sample}.log",
shell:
"mkdir -p {output.out_dir}; "
"{params.exe} {input.sam} {params.unique_taxids} > {output.counts} 2> {log}"
Here is a shell version of this rule:
for sample in $(ls results/CUTADAPT_ADAPTER_TRIMMING/*.fastq.gz); do
python scripts/malt_quantify_abundance.py results/MALT/{sample}.trimmed.sam.gz results/KRAKENUNIQ_ABUNDANCE_MATRIX/unique_species_taxid_list.txt > results/MALT_QUANTIFY_ABUNDANCE/{sample}/sam_counts.txt 2> logs/MALT_QUANTIFY_ABUNDANCE/{sample}.log
done
In summary, this python script takes the results of the MALT alignment and counts the amount of reads per species.
Please run the script provided for this part like this:
sbatch Malt_QuantifyAbundance.sh --account=your_username
After your job has finished, please check the output file for one sample:
cat results/MALT_QUANTIFY_ABUNDANCE/sample1/sam_counts.txt
10.4 Compute the MALT abundance matrix using the SAM files generated by MALT
rule Malt_AbundanceMatrix_Sam:
output:
out_dir=directory("results/MALT_ABUNDANCE_MATRIX_SAM"),
abundance_matrix="results/MALT_ABUNDANCE_MATRIX_SAM/malt_abundance_matrix_sam.txt",
input:
sam_counts=expand(
"results/MALT_QUANTIFY_ABUNDANCE/{sample}/sam_counts.txt", sample=SAMPLES
),
log:
"logs/MALT_ABUNDANCE_MATRIX_SAM/MALT_ABUNDANCE_MATRIX_SAM.log",
params:
exe=WORKFLOW_DIR / "scripts/malt_abundance_matrix.R",
shell:
"Rscript {params.exe} results/MALT_QUANTIFY_ABUNDANCE {output.out_dir} &> {log}"
Here is a shell version of this rule:
for sample in $(ls results/CUTADAPT_ADAPTER_TRIMMING/*.fastq.gz); do
Rscript scripts/malt_abundance_matrix.R results/MALT_QUANTIFY_ABUNDANCE/ results/MALT_ABUNDANCE_MATRIX_SAM/ > logs/MALT_ABUNDANCE_MATRIX_SAM/MALT_ABUNDANCE_MATRIX_SAM.log 2>&1
done
In summary, this R script takes the abundance of all samples and creates a matrix.
Please run the script provided for this part like this:
sbatch Malt_QuantifyAbundance_Sam.sh --account=your_username
After your job has finished, please check the output file:
cat results/MALT_ABUNDANCE_MATRIX_SAM/malt_abundance_matrix_sam.txt
10.5 Compute the MALT abundance matrix using the rma6 files generated by MALT
rule Malt_AbundanceMatrix_Rma6:
output:
out_dir=directory("results/MALT_ABUNDANCE_MATRIX_RMA6"),
abundance_matrix="results/MALT_ABUNDANCE_MATRIX_RMA6/malt_abundance_matrix_rma6.txt",
input:
rma6=expand("results/MALT/{sample}.trimmed.rma6", sample=SAMPLES),
params:
exe=WORKFLOW_DIR / "scripts/rma-tabuliser",
log:
"logs/MALT_ABUNDANCE_MATRIX_RMA6/MALT_ABUNDANCE_MATRIX_RMA6.log",
shell:
"{params.exe} -d $(dirname {input.rma6}) -r 'S' &> {log}; "
"mv results/MALT/count_table.tsv {output.out_dir}; "
"mv {output.out_dir}/count_table.tsv {output.abundance_matrix}"
Here is a shell version of this rule:
for sample in $(ls results/CUTADAPT_ADAPTER_TRIMMING/*.fastq.gz); do
sample_name=$(basename $sample .trimmed.fastq.gz)
./scripts/rma-tabuliser -d results/MALT/${sample_name}.trimmed.rma6 -r 'S' &> logs/MALT_ABUNDANCE_MATRIX_RMA6/MALT_ABUNDANCE_MATRIX_RMA6.log;
mv results/MALT/count_table.tsv results/MALT_ABUNDANCE_MATRIX_RMA6/;
mv results/MALT_ABUNDANCE_MATRIX_RMA6/count_table.tsv results/MALT_ABUNDANCE_MATRIX_RMA6/malt_abundance_matrix_rma6.txt
done
In summary, we use the tool rma-tabuliser to process the rma6 file generated by the MALT alignment. This file contains LCA (latest common ancestor) information for the reads. This command will generate a taxonomy table, with node names and filtered at the species (S) level inside the MALT folder called count_table.tsv. We then move that table to the designated output folder of our rule. We also rename the file to better describe what it contains.
Please run the script provided for this part like this:
sbatch Malt_AbundanceMatrix_Rma6.sh --account=your_username
After your job has finished, please check the output file:
cat results/MALT_ABUNDANCE_MATRIX_SAM/malt_abundance_matrix_sam.txt