Special considerations for amplicon detection

Amplicon sequencing, especially of RNA viruses, requires additional bioinformatics processing to ensure maximum quality of the resulting data.

In RT-PCR, a reverse transcriptase enzyme first generates cDNA molecules using the RNA molecules in the sample as templates, before amplifying the cDNA sequences using a DNA polymerase enzyme during PCR. These amplified cDNA sequences are then further processed to generate the sequencing libraries. Both of these enzymes can potentially introduce an incorrect base into a sequence, generating a position where the resulting sequence does not match the sequence in the sample -- that is, an error.

Reverse transcriptases exhibit error rates that are multiple orders of magnitude higher than those of DNA polymerases.

When large numbers of nucleic acid molecules are present in a reaction, these individual misincorporation errors are largely uncorrelated and appear at very low frequencies, so that they are typically ignored by variant callers.

However, when there is a small number of incoming nucleic acid molecules, such as for a low-titer sample, an error that occurs during the RT step or early in the PCR reaction can, as a result of sampling noise, be amplified to high frequencies in the resulting sequencing libraries. The variant caller may treat this error as a sequence variant, since it is a true sequence variant in the context of the library provided to the instrument. As a result, these artifactual sequence variants often have high allele frequency and quality scores, which makes them very difficult to detect, and appear in the final consensus sequence. While less common, it is also possible for a true sequence variant to have its allele frequency depressed by this same process (if the error results in a reversion to the reference sequence).

Since it is difficult to identify enzyme-introduced false variants after the fact, we instead take a preemptive approach of determining if there is sufficient sample material present before variant calling and consensus sequence generation in order to ensure data quality.

Specifically, the app calculates the number of amplicons with at least 1x coverage for at least 90% of the non-overlapping portion of the amplicon sequence. The 1x coverage threshold used here is fixed and independent of the minimum read coverage depth for consensus sequence generation which defaults to 10x. The number of amplicons that meet this threshold is then divided by the total number of amplicons expected in the experiment, which is the number of amplicons whose location falls in reference sequences selected for short read alignment. If the resulting percentage is at least 80%, the sample is considered to have sufficient material for accurate variant calling. If it is below this threshold, the sample is not processed further to avoid spurious variant calls. The user can override the 80% threshold in the "Minimum percentage of amplicons with at least 90% coverage ≥ 1x to enable variant calling and consensus sequence generation" control in the "Advanced Workflow Settings" section.

The threshold above was determined through data analysis using an experimentally-determined threshold corresponding to minimum concentration needed to produce reliable variant calls. We assumed that higher nucleic acid concentrations leads to a higher probability of amplifying each amplicon.

Last updated