This site is intended to typing Trypanosoma cruzi DTUs by using amplicon sequencing of the minicircle Hypervariable Sequences (mHVR). Here, you can:
1) Download the Reference Databases of mHVR sequences for different strains and DTUs of Trypanosoma cruzi.
2) Run a Google colab notebook for analyzing your own mHVR reads from a strain or sample in order to assign DTU (see Typing Tool Section).
We recommend to visit the tutorial section with a step-by-step guide and a video tutorial.
Typing tool
The link below will redirect you to a Google Colab notebook which allows to run the bioinformatic tools required for typing in Google cloud. This notebook is designed for an easy run of bioinformatic steps for Trypanosoma cruzi DTU typing using mHVR amplicon sequencing data. It is not needed to download the Reference Databases to your computer if you use this tool because the Google colab notebook access to them in GitHub.
Reference Databases
The mHVR reference sequences utilized for DTU typing are available for download below. We constructed two different reference databases by employing 85% and 95% pairwise identity thresholds to define mHVR clusters. For each cluster, the most frequently occurring mHVR sequence was selected as the representative reference. The reference dataset is in FASTA format, where the number after ">" indicates the mHVR cluster ID, and the strain name is also provided. You can find the DTU for each mHVR cluster ID in the "Cluster identity" file.
Set 95%. Recommended
Set 85%. Use only to identify the main DTU in a sample when the 95% set fails to do it.
Tutorial
In this video, a step-by-step usage of the Google colab notebook for typing is shown.
Step-by-step Guide
1. Load ".fastq" or ".fastq.gz" files into a folder in your Google Drive. All ".fastq" or ".fastq.gz" reads in the folder will be analyzed (you should have R1 and R2 files for each sample). Don't use a folder name with spaces, e.g. instead of "my seqs" use "my_seqs". Using Google Drive is recommended because files are not deleted when the notebook is disconnected. However, you should trust in the Colab provider because the Colab will have access to your Google Drive files.
2. Alternatively, you can create a folder by clicking on the folder icon of the left panel, then right-click and "new folder". Upload your files in that folder. An example dataset is provided in order to test.
Set Parameters:
- Set the "folder" parameter name as the folder in your Google Drive or the local folder in the colab.
- Set the reference set (set 95 for more specificity on co-infections or set 85 for samples that cannot be tipified by the 95 referenceSet).
- Ensure that file names end with the values provided in
filename_end_R1
andfilename_end_R2
for each sample, - If the checkbox 'gdrive' is checked, Google Drive files will be used. If it is not checked, the local folder where you uploaded your sequences will be used. Select use_example_dataset if you do not have a dataset but you want to test this tool (an example dataset will be downloaded from github).
- Press Ctrl + F9 to run all cells or run each one at a time by clicking on the play button to the left of each cell. A login page to Google drive will be opened if you selected gdrive, you need to authorize colab to access your files.
- The summary will be printed in the last cell. Additionally, you can find the results in your Google Drive. The results should be in the folder "Typing_Results" within the folder you specified in Google Drive or in the local folder.
- Press Ctrl + F9 to run all cells or run each one at a time by clicking on the play button to the left of each cell. A login page to Google drive will be opened if you selected gdrive, you need to authorize colab to access your files.
- The summary will be printed in the last cell. Additionally, you can find the results in your Google Drive. The results should be in the folder "Typing_Results" within the folder you specified in Google Drive or in the local folder.
e.g. if your filename for R1 reads is sar1006-2022_H1_S9_R1_001.fastq.gz
set
filename_end_R1 = 'R1_001'
my_file_type = '.fastq.gz'
e.g. if your filename for R2 reads is sar1007-2023_H1_S9_R1.fastq
set
filename_end_R1 = 'R2'
my_file_type = '.fastq'
Important! If a step was interrupted by colab disconnection you can run again if you selected the gdrive option:
- Run 'Set Parameters' cell
- Run the cells that were not completed
After specifying the parameters and setting up input and output paths for data analysis, the workflow will follow the steps described in the Materials and Methods section of the associated publication.
Example Reads
The downloadable .zip file containing mHVR sequences used in the example workflow in the colab. It is not required to download these datasets to run the example in the Google colab.
Download mHVR ReadsCitations
Please cite the following papers when using this tool:
Rusman F, Díaz AG, Ponce T, Floridia-Yapur N, Barnabé C, Diosque P, Tomasini N. "Wide reference databases for typing Trypanosoma cruzi based on amplicon sequencing of the minicircle hypervariable region" . PLOS Neglected Tropical Diseases. 2023; 17(11): e0011764. https://doi.org/10.1371/journal.pntd.0011764
Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114-20.https://doi.org/10.1093/bioinformatics/btu170
Renaud, G., Stenzel, U. & Kelso, J. LeeHom: Adaptor trimming and merging for Illumina sequencing reads. Nucleic Acids Res, 2014;42(18):e141.https://doi.org/10.1093/nar/gku699
Robert C. Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26(19):2460–2461.https://doi.org/10.1093/bioinformatics/btq461