This section outlines the workflow required to analyse 18S rRNA gene amplicon sequences to produce Amplicon Sequence Variant information for the Australian microbiome database.
This workflow covers both 18S variable region 4 (18Sv4) amplified by the 18S_V4f/18S_V4r primer set and 18S variable region 9 (18Sv9) amplified by the ILM_Euk_1391f/ILM_EukBr primer set.
Analysis is completed on a per sequencing run basis. The workflow consists of the following stages:
A] Sequence preparation and merging
1. Trim and merge paired end reads (non-merged reads are discarded)
2. Convert fastq file format to fasta file format
3. Add sampleID, runID and "sample=" information to the sequence headers
4. Concatenate all sequences per sequencing run into a single file
B] Sequence analysis
1. Quality screening, zotu/ASV calling and sequence mapping
2. Classify and remove flipped sequences
3. Replace arbitrary ASV ID's with the sequence itself in the table index
C] Following analysis, all data is combined to give a single dataset, by the following steps:
1. Merge tables into a single table
2. Remove controls from the abundance tables, to create separate samples and control datasets
3. Make a fasta from the abundance table
4. Classify Sequences
The following software is used in the steps below:
1. FLASH2 (Magoc and Salzberg, 2011)(for 18Sv9) or FLASH2 (Magoc and Salzberg, 2011)(for 18Sv4)
2. Mothur (Schloss, et al., 2009)
3. USEARCH (Edgar 2010)
4. Seqtk (https://github.com/lh3/seqtk)
5. Fastx (http://hannonlab.cshl.edu/fastx_toolkit/)
A] Sequence preparation and merging
Merge the paired end reads
For 18Sv4 amplicons, primer removal is performed using seqTk by hard triming 20 nucleotides from the 5′ end of R1 sequences and 21 nucleotides from the 5′ end of its respective R2 paired end read. Sequences are merged using FLASH2 (Magoc and Salzberg, 2011). FLASH2 is run with the following arguments,
--min-overlap=50 --max-overlap=160 --allow-outies
18Sv9 paired end reads are merged using FLASH2 with the following arguments
--min-overlap=50 --max-overlap=120 --allow-outies
Following merging, the merge quality is manually checked by examining the FLASH log file for the percentage of reads that were merged. Plates with low merge rates (< 70%) are manually checked to see if the alignments can be improved.
Unmerged reads are discarded.
File names are formatted to the following format:
The sampleID of the mock communities and negative controls on each plate are standardised to the following naming convention (all listed for completeness):
- 16S Bacteria mock communities: BACMOCK
- A16 Archael mock communities: ARCMOCK
- 18Sv4 Eukaryote mock communities: EUKMOCKv4
- 18Sv9 Eukaryote mock communities: EUKMOCKv9
- ITS Fungal mock communities: FUNMOCK
- Negative control: NEG
- STAN: STAN
fastq files are then converted to
fasta using seqtk
Add sample and file name to the sequence headers
Sample identifiers are added to the header of each sequence for downstream processing. As each fasta is now named
SampleID_plateID.fasta we simply add the file name with the extension to the sequence header. At this stage we also add any other information and delimiters that downstream programs will likely require. For
USEARCH we add “sample=” and “;”, for
Qiime we add “_”.
#this initiates the loop for all files with extension .fasta in the list generated by ls -l
for file in `ls -1 *.fasta`
#tell me which file you're working on
echo working on $file
#cut the appropriate string from the name of the file.
VAR1=`echo $file | cut -d "." -f 1`
#tell me what you've cut and called VAR1
echo VAR1 is $VAR1
#use VAR1 to add the sample name to the sequence id. Substitute > for >VAR1_ in all files. The variable (VAR1) needs to be in quotation marks to be passed from bash to perl, the perl script needs to be embedded in the bash script with the ` ` marks
`perl -pi -e 's/\>/\>sample='$VAR1';_18S_/' $file;`
Concatenate all sequences into a single file
All of the files for each plate are concatenated into a single file zotu/ASV calling. The resulting file name is standardised to the format:
Where 18SVn represents the 18S variable region being analysed (18SV4 or 18SV9)
B] Sequence analysis
Quality screening, zotu/ASV calling and sequence mapping
The first step performs some quality control on the sequences, with each ampplicon having different parameters:
For 18SV4, the first step removes sequences that are too short, have ambiguous bases, or have more than 12 homopolymers.
mothur "#set.dir(modifynames=F);summary.seqs(fasta=plateID_all_18SVn.fasta, processors=10); screen.seqs(fasta=current, minlength=300, maxambig=0, maxhomop=12, processors=10); summary.seqs()"
For 18SV9 the first step removes sequences have ambiguous bases, or have more than 12 homopolymers
mothur "#set.dir(modifynames=F);summary.seqs(fasta=plateID_all_18SVn.fasta, processors=10); screen.seqs(fasta=current, maxambig=0, maxhomop=12, processors=10); summary.seqs()"
Next reads are dereplicated
usearch64 -fastx_uniques plateID_all_16S.good.fasta -fastaout plateID_all_18SVn.good_uniques.fasta -sizeout
Sort unique reads by abundance
usearch64 -sortbysize plateID_all_18SVn.good_uniques.fasta -fastaout plateID_all_16S.good_sorted_uniques.fasta -sizeout
ASV/zotus are called by
UNOISE3, from sequences that have => 4 representatives
usearch64 -unoise3 plateID_all_18SVn.good_sorted_uniques.fasta -zotus plateID_all_18SVn.good_sorted_uniques_zotus.fasta -ampout plateID_all_18SVn.good_sorted_uniques_ampout.fasta -tabbedout plateID_all_18SVn.good_sorted_uniques_unoise3.txt -minsize 4
Map reads to zotus to generate abundances
Reads are mapped against the zotus using
USEARCH. Note the termination conditions on the mapping run (
-maxaccepts 0), this seems to be required to ensure the best match is found and to produce consistent results when adding multiple plates together, as we do later.
usearch64 -otutab plateID_all_18SVn.fasta -zotus plateID_all_18SVn.good_sorted_uniques_zotus_renamed.fasta -otutabout plateID_all_18SVn.good_sorted_uniques_zotutab_MA0.txt -mapout plateID_all_18SVn.good_sorted_uniques_zmap_MA0.txt -maxaccepts 0 -threads X
Classify and remove flipped sequences
A final QC step is perfomed to remove likely erroneous sequences. The ASVs are classified, with those that do not align to the 18S Silva database (Quast et al., 2013; Yilmaz et al 2014; Glöckner 2017) in the correct orientation being removed. Those that need to be “flipped” to a new orientation are likely errors, since we know the reads should be in correct orientation for their respective primer sets 18S_V4f/18S_V4r or ILM_Euk_1391f/ILM_EukBr.
1. classify the seqs against 18S database
"#set.dir(modifynames=F); classify.seqs(fasta=plateID_all_18SVn.good_sorted_uniques_zotutab_relabelled_MA0.fasta, reference=//OSM/CBR/OA_AMD/amd-work/DBs/GG_mothur/gg_13_8_99.fasta, taxonomy=//OSM/CBR/OA_AMD/amd-work/DBs/GG_mothur/gg_13_8_99.gg.tax, cutoff=60, probs=FALSE, processors=5)"
2. use the *acnos.flip list to remove sequences from the table
The above produces the final abundance table, with sequences as index for each sequencing run. These tables are then combined as below to produce a single dataset.
Replace table indexes with ASV sequence
Currently the tables have an arbitary OTU number as the index, replace this with the sequence that the arbitrary number represents. These sequences are unique strings and allow tables to be merged etc. downstream easily. They also negate the need to maintain a dictionary of ASVs and the sequences they represent.
C] Prepare the single dataset
Now we have an ASV abundance table for each plate, with ASV’s as row and SampleID_runID as column headers. To prepare this data for ingest into the AMD database the following steps are carried out:
1. Each table is converted from short to long format (from rectangular to 3 column, with columns ASV, sampleID, Abundance)
2. All of these 3 column tables are concatenated into a single table
3. Controls and samples are split into separate tables
4. Sequencing run ID's are removed from the column headers and any sample sequenced more than once has the abundances from these runs summed to give a single abundance per sample
5. A fasta file of unique ASV's is created from all ASV's in this final table
6. Sequences are classified.
Classify the sequences
Sequences are classified to provide taxonnomies relative to the Silva database (Quast et al., 2013; Yilmaz et al 2014; Glöckner 2017) as below:
mothur "#classify.seqs(fasta=seqs.fasta, reference=SILVA132.ng.fasta, taxonomy=SILVA132.tax, cutoff=60, probs=FALSE, processors=X)"
Magoc, T. and Salzberg, S. (2011). FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27(21): 2957-2963.
Schloss, P.D., et al., (2009). Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75(23):7537-7541.
Edgar, R.C., (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19): 2460-2461.
Seqtk available at: https://github.com/lh3/seqtk (last accessed 23 Jan 2019).
Fastx available at: (http://hannonlab.cshl.edu/fastx_toolkit/) (last accessed 23 Jan 2019).
DeSantis T.Z., Hugenholtz P., Larsen N., Rojas M., Brodie E.L., Keller K., Huber T., Dalevi D., Hu P., Andersen G.L.(2006) Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl. Environ. Microbiol. 72(7): 5069-5072; DOI: 10.1128/AEM.03006-05
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl. Acids Res. 41 (D1): D590-D596.
Yilmaz P, Parfrey LW, Yarza P, Gerken J, Pruesse E, Quast C, Schweer T, Peplies J, Ludwig W, Glöckner FO (2014) The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks. Nucl. Acids Res. 42:D643-D648
Glöckner FO, Yilmaz P, Quast C, Gerken J, Beccati A, Ciuprina A, Bruns G, Yarza P, Peplies J, Westram R, Ludwig W (2017) 25 years of serving the community with ribosomal RNA gene reference databases and tools. J. Biotechnol.