Advanced Typing of Trypanosoma cruzi through NGS Sequencing

This site is intended to typing Trypanosoma cruzi DTUs by using amplicon sequencing of the minicircle Hypervariable Sequences (mHVR). Here, you can:

1) Download the Reference Databases of mHVR sequences for different strains and DTUs of Trypanosoma cruzi.

2) Run a Google colab notebook for analyzing your own mHVR reads from a strain or sample in order to assign DTU (see Typing Tool Section).

We recommend to visit the tutorial section with a step-by-step guide and a video tutorial.

Typing tool

The link below will redirect you to a Google Colab notebook which allows to run the bioinformatic tools required for typing in Google cloud. This notebook is designed for an easy run of bioinformatic steps for Trypanosoma cruzi DTU typing using mHVR amplicon sequencing data. It is not needed to download the Reference Databases to your computer if you use this tool because the Google colab notebook access to them in GitHub.

Access Google Colab

Reference Databases

The mHVR reference sequences utilized for DTU typing are available for download below. We constructed two different reference databases by employing 85% and 95% pairwise identity thresholds to define mHVR clusters. For each cluster, the most frequently occurring mHVR sequence was selected as the representative reference. The reference dataset is in FASTA format, where the number after ">" indicates the mHVR cluster ID, and the strain name is also provided. You can find the DTU for each mHVR cluster ID in the "Cluster identity" file.

Set 95%. Recommended

Set 85%. Use only to identify the main DTU in a sample when the 95% set fails to do it.

Tutorial

In this video, a step-by-step usage of the Google colab notebook for typing is shown.

Step-by-step Guide

1. Load ".fastq" or ".fastq.gz" files into a folder in your Google Drive. All ".fastq" or ".fastq.gz" reads in the folder will be analyzed (you should have R1 and R2 files for each sample). Don't use a folder name with spaces, e.g. instead of "my seqs" use "my_seqs". Using Google Drive is recommended because files are not deleted when the notebook is disconnected. However, you should trust in the Colab provider because the Colab will have access to your Google Drive files.

2. Alternatively, you can create a folder by clicking on the folder icon of the left panel, then right-click and "new folder". Upload your files in that folder. An example dataset is provided in order to test.

Set Parameters:

Set the "folder" parameter name as the folder in your Google Drive or the local folder in the colab.
Set the reference set (set 95 for more specificity on co-infections or set 85 for samples that cannot be tipified by the 95 referenceSet).
Ensure that file names end with the values provided in filename_end_R1 and filename_end_R2 for each sample,

e.g. if your filename for R1 reads is sar1006-2022_H1_S9_R1_001.fastq.gz

set

filename_end_R1 = 'R1_001'

my_file_type = '.fastq.gz'

e.g. if your filename for R2 reads is sar1007-2023_H1_S9_R1.fastq

set

filename_end_R1 = 'R2'

my_file_type = '.fastq'

If the checkbox 'gdrive' is checked, Google Drive files will be used. If it is not checked, the local folder where you uploaded your sequences will be used. Select use_example_dataset if you do not have a dataset but you want to test this tool (an example dataset will be downloaded from github).
Press Ctrl + F9 to run all cells or run each one at a time by clicking on the play button to the left of each cell. A login page to Google drive will be opened if you selected gdrive, you need to authorize colab to access your files.
The summary will be printed in the last cell. Additionally, you can find the results in your Google Drive. The results should be in the folder "Typing_Results" within the folder you specified in Google Drive or in the local folder.
Press Ctrl + F9 to run all cells or run each one at a time by clicking on the play button to the left of each cell. A login page to Google drive will be opened if you selected gdrive, you need to authorize colab to access your files.
The summary will be printed in the last cell. Additionally, you can find the results in your Google Drive. The results should be in the folder "Typing_Results" within the folder you specified in Google Drive or in the local folder.

Important! If a step was interrupted by colab disconnection you can run again if you selected the gdrive option:

Run 'Set Parameters' cell
Run the cells that were not completed

After specifying the parameters and setting up input and output paths for data analysis, the workflow will follow the steps described in the Materials and Methods section of the associated publication.

Example Reads

The downloadable .zip file containing mHVR sequences used in the example workflow in the colab. It is not required to download these datasets to run the example in the Google colab.

Download mHVR Reads

Citations

Please cite the following papers when using this tool:

Rusman F, Díaz AG, Ponce T, Floridia-Yapur N, Barnabé C, Diosque P, Tomasini N. "Wide reference databases for typing Trypanosoma cruzi based on amplicon sequencing of the minicircle hypervariable region" . PLOS Neglected Tropical Diseases. 2023; 17(11): e0011764. https://doi.org/10.1371/journal.pntd.0011764

Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114-20.https://doi.org/10.1093/bioinformatics/btu170

Renaud, G., Stenzel, U. & Kelso, J. LeeHom: Adaptor trimming and merging for Illumina sequencing reads. Nucleic Acids Res, 2014;42(18):e141.https://doi.org/10.1093/nar/gku699

Robert C. Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26(19):2460–2461.https://doi.org/10.1093/bioinformatics/btq461

Typing of Trypanosoma cruzi through mHVR Amplicon Sequencing