Fungal ITS Amplicon Analysis Workflow

This section outlines the workflow required to analyse amplicon sequences of the internal transcribed spacer (ITS) region located between the small and large rRNA subunits to produce Amplicon Sequence Variant (ASV) information for the Australian microbiome Initiative. Amplicons are derived from primers targeting the fungal ITS1 and ITS4 regions (ITS1F and ITS4).
Analysis is completed on a per sequencing run (sequencing plate) basis. The workflow consists of the following stages:

 A] Sequence preparation and merging
1. Convert fastq file format to fasta file format
2. Identify and isolate putative fungal ITS1 and ITS2 regions from R1 and R2 reads
3. Add sampleID, runID and "sample=" information to the sequence headers
4. Concatenate all sequences per sequencing run into a single file
B] Sequence analysis
1. Quality screening, zotu/ASV calling and sequence mapping
2. Classify and remove flipped sequences
3. Replace arbitrary ASV ID's with the sequence itself in the table index
C] Following analysis, all data is combined to give a single dataset, by the following steps:
1. Merge tables into a single table
2. Remove controls from the abundance tables, to create separate samples and control datasets
3. Make a fasta from the abundance table
4. Classify Sequences

Software Required

The following software is used in the steps below:

 1. ITSx (Bengtsson-Palme et al., 2013)
2. Mothur (Schloss, et al., 2009)
3. USEARCH (Edgar 2010)
4. Seqtk (https://github.com/lh3/seqtk)
5. Fastx (http://hannonlab.cshl.edu/fastx_toolkit/)
6. Python3.X

A] Sequence preparation and merging

Identify and isolate ITS1 and ITS2 regions

Illumina fastq R1 and R2 files are first converted to fasta file format using SeqTk. In addition, SeqTk is used to generate the reverse complement of R2 reads.

ITSx (Bengtsson-Palme et al., 2013) is the used to identify and isolate fungal ITS1 and ITS2 regions from neighbouring ribosomal genes (SSU, 5S and LSU rRNA sequences). Arguments used for ITSx are as follows:

 -t F --complement F --preserve T --partial 100 --save_regions ITSn --detailed_results T

R1 and R2 reads not identified as ITS by ITSx are discarded

Rename files

File names are formatted to the following format:

 sampleID_plateID.fasta 

The sampleID of the mock communities and negative controls on each plate are standardised to the following naming convention (all listed for completeness):

- 16S Bacteria mock communities: BACMOCK
- A16 Archaeal mock communities: ARCMOCK
- 18Sv4 Eukaryote mock communities: EUKMOCKv4
- 18Sv9 Eukaryote mock communities: EUKMOCKv9
- ITS Fungal mock communities: FUNMOCK
- Negative control: NEG
- STAN: STAN

Add sample and file name to the sequence headers

Sample identifiers are added to the header of each sequence for downstream processing. As each fasta is now named SampleID_plateID.fasta we simply add the file name with the extension to the sequence header. At this stage we also add any other information and delimiters that downstream programs will likely require. For Usearch we add “sample=” and “;”, for Qiime we add “_”.

#!/bin/bash
#this initiates the loop for all files with extension .fasta in the list generated by ls -l
for file in `ls -1 *.fasta`
do

#tell me which file you're working on
echo working on $file
#cut the appropriate string from the name of the file.
VAR1=`echo $file | cut -d "." -f 1`
#tell me what you've cut and called VAR1
echo VAR1 is $VAR1
#use VAR1 to add the sample name to the sequence id.  Substitute > for >VAR1_ in all files.  The variable (VAR1) needs to be in quotation marks to be passed from bash to perl, the perl script needs to be embedded in the bash script with the ` ` marks
`perl -pi -e 's/\>/\>sample='$VAR1';_ITS_/' $file;`
done

Concatenate all sequences into a single file

All of the files for each plate are concatenated into a single file zotu/ASV calling. The resulting file name is standardised to the format: plateID_all_ITSn.fasta.
Where ITSn represents the ITS region being analysed (ITS1 or ITS2)

B] Sequence analysis

Quality screening, zotu/ASV calling and sequence mapping

The first step removes sequences that have ambiguous bases, or have more than 12 homopolymers.

mothur "#set.dir(modifynames=F);summary.seqs(fasta=plateID_all_ITSn.fasta, processors=10); screen.seqs(fasta=current, maxambig=0, maxhomop=12, processors=10); summary.seqs()" 

Next reads are dereplicated

usearch64 -fastx_uniques plateID_all_ITSn.good.fasta -fastaout plateID_all_ITSn.good_uniques.fasta -sizeout

Sort unique reads by abundance

usearch64 -sortbysize plateID_all_ITSn.good_uniques.fasta -fastaout plateID_all_ITSn.good_sorted_uniques.fasta -sizeout 

ASV/Zotus are called by UNOISE3, from sequences that have => 4 representatives

usearch64 -unoise3 plateID_all_ITSn.good_sorted_uniques.fasta -zotus plateID_all_ITSn.good_sorted_uniques_zotus.fasta -ampout plateID_all_ITSn.good_sorted_uniques_ampout.fasta -tabbedout plateID_all_ITSn.good_sorted_uniques_unoise3.txt -minsize 4 

Map reads to zotus to generate abundances

Reads are mapped against the zotus using USEARCH. Note the termination conditions on the mapping run (-maxaccepts 0), this seems to be required to ensure the best match is found and to produce consistent results when adding multiple plates together, as we do later.

usearch64 -otutab plateID_all_ITSn.fasta -zotus plateID_all_ITSn.good_sorted_uniques_zotus_renamed.fasta -otutabout plateID_all_ITSn.good_sorted_uniques_zotutab_MA0.txt -mapout plateID_all_ITSn.good_sorted_uniques_zmap_MA0.txt -maxaccepts 0 -threads X  

Classify and remove flipped sequences

A final QC step is performed to remove likely erroneous sequences. The ASVs are classified, with those that do not align to the UNITE SH ITS database in the correct orientation being removed. As we know the that R1 sequences correctly orientated and the reverse complement of the R2 also puts it into the correct orientation, sequences that need to be “flipped” to a new orientation to obtain the best alignment to the database are most likely errors.

1. classify the seqs against UNITE ITS database

classify.seqs(fasta=${plateID}_all_ITSn.good_sorted_uniques_zotutab_relabelled_MA0.fasta, reference=UNITEv7_sh_dynamic_s.fasta, taxonomy=UNITEv7_sh_dynamic_s.tax, cutoff=60, probs=T) 

2. use the *acnos.flip list to remove sequences from the table

The above produces the final abundance table, with sequences as index for each sequencing run. These tables are then combined as below to produce a single dataset.

Replace table indexes with ASV sequence

Currently the tables have an arbitary OTU number as the index, replace this with the sequence that the arbitrary number represents. These sequences are unique strings and allow tables to be merged etc. downstream easily. They also negate the need to maintain a dictionary of ASVs and the sequences they represent.

C] Prepare the single dataset

Now we have an ASV abundance table for each plate, with ASV’s as row and SampleID_runID as column headers. To prepare this data for ingest into the AMI database the following steps are carried out:

 1. Each table is converted from short to long format (from rectangular to 3 column, with columns ASV, sampleID, Abundance)
2. All of these 3 column tables are concatenated into a single table
3. Controls and samples are split into separate tables
4. Sequencing run ID's are removed from the column headers and any sample sequenced more than once has the abundances from these runs summed to give a single abundance per sample
5. A fasta file is created from all ASV's in this final table
6. Sequences are classified.

Classify the sequences

Sequences are classified to provide taxonomies relative to the UNITE ITS database as below:

mothur "#classify.seqs(fasta=seqs_listSET.fasta, reference=UNITEv7_sh_dynamic_s.fasta, taxonomy=UNITEv7_sh_dynamic_s.tax, cutoff=60, probs=FALSE)" 

References

Bengtsson-Palme, J., Veldre V., Ryberg M., Hartmann M., Branco S., Wang Z., Godhe A., Bertrand Y., De Wit P., Sanchez M., Ebersberger I., Sanli K., de Souza F., Kristiansson E., Abarenkov K., Eriksson K.M, Nilsson R.H.(2013). ITSx: Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for use in environmental sequencing. Methods in Ecology and Evolution, 4: 914-919.
Schloss, P.D., et al., (2009). Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75(23):7537-7541.
Edgar, R.C., (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19): 2460-2461.
Seqtk available at: https://github.com/lh3/seqtk (last accessed 23 Jan 2019).
Fastx available at: (http://hannonlab.cshl.edu/fastx_toolkit/) (last accessed 23 Jan 2019).
Nilsson RH, Larsson K-H, Taylor AFS, Bengtsson-Palme J, Jeppesen TS, Schigel D, Kennedy P, Picard K, Glöckner FO, Tedersoo L, Saar I, Kõljalg U, Abarenkov K. (2018). The UNITE database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications. Nucleic Acids Research, DOI: 10.1093/nar/gky1022