Reference Sequence Browser

Welcome to the Reference Sequence Browser

Read this page first if you want to make best use of our app!

This app was developed by the BlueWaltzBio team.

The Reference Sequence Browser (RSB) is a rShiny application that enables the eDNA community to screen and extract organism-specific data from genetic barcode databases such as the NCBI Nucleotide , NCBI Genome , BOLD (Barcode for Life Database) , and publicly available CRUX databases . The app is best used prior to conducting metabarcoding to assess reference sequence availability or to download non duplicative FASTA files from BOLD and NCBI to develop local sequence databases for q-pcr development. Each different database can be searched in its own tab in RSB and has its own quick user-guide.

The tool is built to search for many organisms and genetic barcodes simultaneously and thus can be used in multiple ways to speed up Environmental DNA workflows. Users only need to assemble a list of organisms in scientific names for the tool to search for reference sequences at known barcoding loci (e.g COI, 16S, 18S, trnL ). Only NCBI additionally requires barcode-gene names to complete its Nucleotide search. To make inputting long lists into the app easier, download this CSV template and fill it out with your organism and barcode names. This template can be uploaded within any database tab.

The app uses the R packages “Rentrez” and “Bold” to access the live, up to date NCBI and BOLD databases. RSB retrieves and displays the number of reference sequences in one of the aforementioned databases for the combination of an organism AND a barcode. For example, RSB searches for instances of Canis Lupus’s COI barcode-gene. Thus every database tab produces a Coverage Matrix (CM) with organism names as rows and barcoding loci as columns, with cells that display the number of reference sequences retrieved. Each tab also includes summary statistics and tab specific visualizations/tables to help make sense of searches of numerous species and barcodes.

In addition to previewing what sequences are available in the CM tables, users are able to download the sequences in FASTA file format (excluding CRUX); which can then easily be imported into various genomics softwares (e.g Geneious, etc). The RSB BOLD database search allows users to exclude entries also in NCBI Nucleotide from the BOLD CM results, visualizations, and FASTA downloads. This allows users to avoid downloading duplicate FASTA files between BOLD and NCBI Nucleotide.

Use cases:

RSB can be used to improve the workflow of species specific q-pcr development for eDNA applications (Klymus 1). If you are interested in developing a species specific q-primer, RSB can be used to rapidly create a non-duplicative local sequence database by downloading FASTA files from the NCBI and BOLD tabs.

RSB can also be used to determine what organisms can and to what be detected by metabarcoding and which metabarcodes fit your study needs This can be determined by searching in the seven metabarcoding databases ( 16S, Vertebrate 12S, 18S, Plant ITS1, CO1, Fungal ITS2, and trnL ) in the CRUX tab or in the BOLD tab.

Additionally, If you are interested in finding full mitochondrial or chloroplast genomes in NCBI Nucleotide or entries in NCBI Genome go to the full genome tab. Guides to the aforementioned processes can be found lower down on this page.

Lastly, we hope this tool may be used to point to taxonomic groups lacking publically available reference sequences and thus aid in creating more deliberate and specific sequencing efforts.

This rShiny app was built in part to bridge the gap between eDNA scientists and large genomics databases by providing efficient and high throughput access to NCBI, BOLD, and CRUX databases without the user having to write a single line of code. Click on one of the tabs to get started.

Welcome to the CRUX Metabarcoding Pipeline

The CRUX pipeline of RSB takes in a list of organism(s) and searches through the seven publically available CALeDNA CRUX Metabarcode databases to find how many records match the search. The RSB searches through a copy of these databases that are updated periodically. The last update was in October 2019. When direct matches are not found in a database, the tool will then search for higher taxonomic ranks (genus, family, order, class, phylum, domain), via the R package “Taxize”, until a match is found. For example, if the Giant Seastar ( Pisaster giganteus ) isn’t found in the COI database the app will search for the presence of the genus Pisaster, and then family Asteriidae and so forth.

Users are given the choice to use the package “Taxize” to append synonyms and correct spelling mistakes of organism names. The tool then showcases a Coverage Matrix (CM), showing the reference sequence abundance or taxonomic resolution for each marker/gene per organism, and a statistical summary of the CM.

Choose CSV file to upload

Browse...

Organism Names

A comma separated list of the names for your organism(s) of interest. All taxonomic ranks (family, genus, species-genus, etc) are searchable

Append organism name synonyms and spelling corrections via the R Package “Taxize”

Summary of Search Results

For each barcode, we display the total number of database entries found, the percentage of the total number of database entries that each marker/gene accounts for, and the number of organisms with at least one or no sequences found

Download summary data

Instances of an Organism Found in Each CRUX Metabarcoding Database

Download table

Find CRUX database coverage of your organisms of interest

Introduction and the role of reference databases:

The ‘CRUX Coverage Matrix’ tool identifies how well sets of organisms are represented in the public CALeDNA reference databases. This tool may be used by either scientists interested in using the public CALeDNA reference database or parties interested in working with CALeDNA scientists for conducting a study. The site displays the taxonomic resolution and reference sequence abundance available within the CALeDNA databases for different organisms, which informs the user how well the databases may fulfill the needs of their study.

The databases were generated using the CRUX Pipeline, part of the Anacapa Toolkit (Curd et al., 2019 in MEE). The full databases and further documentation can be found here: https://ucedna.com/reference-databases-for-metabarcoding.

Reference databases allow for the taxonomic assignment of metagenomic sequences. DNA sequences are taxonomically identified by matching them to previously identified reference sequences. Large swaths of organisms and taxonomic groups have yet to be DNA barcoded, and thus cannot be detected using metagenomic sequencing. However, organizations like BOLD and the Smithsonian are currently working to fill these holes in our reference libraries.

You can find the public CRUX databases at this link: https://ucedna.com/reference-databases-for-metabarcoding

What does the CRUX Coverage Matrix do?

The ‘CRUX Coverage Matrix’ returns a value that represents how many reference sequences exist for the user’s organism search term(s) in each public database. When direct matches are not found in a database, the tool will instead search for lower taxonomic ranks until a match is found. When a metabarcoding study is being performed, it is critical to confirm the existence of and obtain the reference sequences of organisms of interest, as well as the taxonomic resolution of said sequences, and what metabarcoding loci the reference sequences belong to. (See additional information for more details) CRUX databases are designed to be shared, and this tool allows users to assess whether the public CRUX databases meet their study’s taxonomic requirements.

The ‘CRUX Coverage Matrix’ searches by taxonomic ranks: domain, phylum, class, order, family, genus, genus-spp.The rows of the table produced are the organism search terms, and the columns are CRUX databases: 16S, 12S, 18S, PITS, CO1, FITS, trnL, Vertebrate.

The cells will show one of the following:

1) The number of sequences in a database, if direct matches are found
2) If no direct matches are found, the next most specific taxonomic rank found
3) “0” if nothing is found at any taxonomic rank.

CRUX User Guide

Welcome to the NCBI Nucleotide Pipeline

The NCBI Nucleotide pipeline of RSB takes in a list of organism(s) and barcode-gene(s) of interest and then directly queries the Nucleotide database using the “Rentrez” package to find how many records match the search. To further tailor the search, users are able to:

Set minimum and maximum sequence lengths parameters
Use the package “Taxize” to append synonyms and correct spelling mistakes of organism names
Choose to search for results either within all fields or specifically in the organism and/or gene metadata fields of NCBI data.

The tool then showcases a Coverage Matrix (CM), described on the RSB home page, a statistical summary of the CM, and the search statements used to query NCBI Nucleotide. Users are able to download the results showcased and FASTA and GenBank files up to a preset limit. Greater detail for the options and displays are provided in the relevant portions of the pipeline.

Choose CSV file to upload

Browse...

Organism Names

A comma separated list of the names for your organism(s) of interest. All taxonomic ranks (family, genus, species-genus, etc) are searchable

Append organism name synonyms and spelling corrections via the R Package “Taxize”

Search by the [ORGN] Metadata field

Barcodes of Interest

A comma separated list of the genes you want to search. Common genes used as organism barcodes include: CO1, 16S, 18S

Search by the [GENE] Metadata field

Set minimum sequence lengths(by marker)

Will you be downloading sequences?

If you are interested in downloading FASTA or Genbank files from the results, you must pre-specificy the number of sequences to download per organism-barcode combination.

If you will not be downloading any sequences, then you can simply move on to the next step

Number of Sequences to Download per Cell:

Summary of Search Results

Download summary data

Number of Sequences Found Per Organism-Barcode Pairing

Download counts table Download FASTA files Download Genbank files

An Example of the Search Queries Sent to NCBI

Below we show an example of some of the queries we are sending to NCBI Nucleotide. If you wish to check the validity of our results, you can go to the NCBI website and paste these queries into their online search tool and see if you get the same results. If you want to see all of the queries used in this search, you can download them using the button located below.

Download the full search terms table

Find NCBI records of your organisms and barcodes of interest

What does the tool do?

The ‘NCBI Nucleotide Coverage Matrix’ was designed to screen the Nucleotide database for genetic barcode coverage prior to environmental DNA metabarcoding studies. Before conducting a metabarcoding study, scientists need to be aware of which organisms have reference sequences at known genetic barcoding loci. The tool finds out if the Nucleotide database contains sequences labeled with a specific gene and organism name. Numerous searches can be done in parallel instead of manually searching for each organism-gene combination on the NCBI Nucleotide website.

The ‘NCBI Nucleotide Coverage Matrix’ tool takes in a list of organisms and genes of interest and then queries the Nucleotide database to find how many records match the search. The tool then produces a table where the organism names are rows, gene names are columns, and each intersection of a row and column shows how many records are in the NCBI Nucleotide database. All of the search options are detailed in the ‘Search fields' section below. The power and flexibility of this tool allows scientists to check the NCBI Nucleotide database for genetic coverage in ways that aren’t possible without knowledge of the NCBI “Entrez” coding package.

User Guide

Click this link to read the user guide

Limitations:

This tool may not find all possible entries that the user desires. Some limitations of this text based search include, but are not limited to:

1) Alternative names of the listed gene in NCBI Nucleotide database
2) Incorrect or missing metadata
3) Full genomes entries with unlabeled individual genes

Searching by Blast or in Silico PCR are more empirical ways of identifying quality reference sequences, but are not implemented here. In addition, This tool only searches against the NCBI Nucleotide database, and thus doesn’t include other genetic databases such as BOLD, Silva, Fishbase, etc.

Welcome to the BOLD Metabarcoding Pipeline

The BOLD, Barcode of Life, pipeline of the Reference Sequence Browser (RSB) is designed to screen the BOLD database for genetic barcode coverage using a list of scientific names provided by the user. Once inputted, the application will conduct a search on the most up to date version of the BOLD database using the BOLD Systems API Package for R.

The tool allows users to manipulate the results gathered by prompting the user to filter results by country of origin and by giving the option to remove entries also present in NCBI. All subsequent tables and visualizations are based on this filtering, which can be changed by returning to the appropriate tabs at any time.

Like the other database pipelines, the app produces a CM where the organism names are rows, gene names are columns, and each intersection of a row and column shows how many records are in the BOLD database. The app also creates a statistical summary of the CM and users may download the associated FASTA files. Afterwards, a variety of tables and graphs are presented to help users analyze the data and fine tune their filters.

Choose CSV file to upload

Browse...

Organism Names

A comma separated list of the names for your organism(s) of interest. All taxonomic ranks (family, genus, species-genus, etc) are searchable

Append organism name synonyms and spelling corrections via the R Package “Taxize”

Species not found in BOLD database

List of species that were not found when searching the BOLD database, if the species are not displayed in this table then they have results in the BOLD database. Additionally, the results in this table are not affected by the Country or NCBI filter

NCBI Entries Filter

If the box is checked all entries also found in NCBI will be removed.

If left unchecked then the NCBI entry removal filter won't applied

Please Check the box to the left to remove all entries also present in NCBI

Summary of Search Results

Download Summary Data Table

Sequences Found for each Species Classified by Barcode

Total number of entries found for each species classified into the different barcodes found.

The different barcodes found in BOLD are ordered left to right from most to least results

Download Fasta Files Download Counts Table

Total Number of Barcodes Found by Country for Each Unique Species

Download entries per country table

Suggested Countries to Add to Your Filter

For those species that have no barcodes found in the country(s) filtered we provide the top 3 unselected countries with the most sequence results.

Download Graph

Bold UIDs for Entries With Unlabeled Barcodes

For those entries in BOLD that have unlabeled barcodes, we provide their respective BOLD UIDs so that scientists may explore these results further.

Download NA Barcodes table

Contact us

This app was developed by the BlueWaltzBio team.

Learn more about us on our website BlueWaltzBio.com or direct message us on twitter @BlueWaltzBio

If you want to provide feedback please use the google link: https://forms.gle/ysT6g8sk1zxWQ1wZA

The app was iteratively built with direct feedback from eDNA scientists who spoke with members of BlueWaltzBio. In total our team has interviewed over 70 lab techs, professors, government regulators, and investors in the field of Environmental DNA.

Twitter