Custom reference FASTA and BED files

Custom reference FASTA file:

A custom reference FASTA file containing one or more reference sequences is required to run the custom reference sequence analysis. In the FASTA file, sequence names must be unique and should not contain any spaces. If there is any space in the FASTA header, the part before the first space is assumed to be the sequence name. It is recommended to use only the following in sequence names: alphabets, numbers, underscore (_), hyphen (-), parentheses ((,)), and period (.). Otherwise, the sequence names may appear different in the output. An example custom reference FASTA file is provided in the link below.

Example Fasta file formatting

To upload a custom reference FASTA file, go to the "Projects" tab and click on the folded paper icon (representing File) to reveal a dropdown menu. Click on "Upload" and select "Files". Within the upload page, select "Other" format for FASTA files, and upload the file as a Biosample. Within the DRAGEN Microbial Enrichment Plus app, under "Custom panel specification" use the "Custom reference FASTA for consensus generation" control to select the uploaded FASTA file.

Custom reference BED file (optional):

Optionally, a custom reference BED file may also be provided. Sequence names must match between the FASTA file and BED file, and the same set of sequences must appear in both files. If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.

The BED file controls how sequences are grouped and labeled in the output. If the custom reference FASTA file includes sequences from multiple segments of a viral genome, it is recommended to provide a BED file so that the segments are included under the results of that microorganism.

The BED file must be tab-delimited with at least 4 columns:

  1. chrom: the sequence name as it appears in the FASTA

  2. chromStart: start position (always set to 0)

  3. chromEnd: end position (sequence length)

  4. genomeName: name of the genome, target, or microorganism the sequence belongs to (e.g. Monkeypox virus clade II)

  5. segmentName (optional): the name of the segment or gene (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome

Example custom reference BED file:

NC_012532.1	0	10794	Zika	Full
KJ609203.1	0	2292	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 1 (PB2)
KJ609204.1	0	2304	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 2 (PB1)
KJ609205.1	0	2168	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 3 (PA+PA-X)
KJ609206.1	0	1727	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 4 (HA)
KJ609207.1	0	1530	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 5 (NP)
KJ609208.1	0	1441	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 6 (NA)
KJ609209.1	0	1001	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 7 (M1+M2)
KJ609210.1	0	866	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 8 (NS1+NEP)
MK239128.1	0	2316	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 1 (PB2)
MK239126.1	0	2316	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 2 (PB1)
MK239124.1	0	2208	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 3 (PA+PA-X)
MK239073.1	0	1737	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 4 (HA)
MK239074.1	0	1540	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 5 (NP)
MK239123.1	0	1441	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 6 (NA)
MK239125.1	0	1002	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 7 (M1+M2)
MK239127.1	0	865	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 8 (NS1+NEP)

To upload a custom reference BED file, go to the "Projects" tab and click on the folded paper icon (representing File) to reveal a dropdown menu. Click on "Upload" and select "Files". Within the upload page, select "Other" format for BED files, and upload the file as a Biosample. Within the DRAGEN Microbial Enrichment Plus App, under "Custom panel specification" use the "Custom reference BED (optional)" dropdown to select the uploaded BED file.

Pangolin custom analysis behavior:

For Custom Panel analyses, Pangolin is enabled and will run on custom reference sequences with at least 3% coverage that meet these naming conventions:

  • If only a FASTA file is provided, Pangolin will run on sequences that have a header containing either SARS-CoV-2 or NC_045512

  • If both a FASTA and BED file are provided, Pangolin will run on sequences where the first column (chrom) contains NC_045512 or the fourth column (genomeName) contains SARS-CoV-2

Nextclade custom analysis behavior:

For Custom Panel analyses, Nextclade is disabled and will not be run. Do not enable Nextclade.

Last updated

Was this helpful?