Data Naming Convention | ggKbase Knowledge Base

Abstract

All the data in ggKbase – projects, organisms, contigs, and features – must have unique names so that every entity can be queried by its name without the need to provide a search scope in the query.

History

Historically, the only kind of data that ggKbase requires to have a unique name is the project because it’s at the highest organizing level for all the lower level data, such as organisms, contigs, and features. Note: It is at the Project level that user permission is set.

The problem we face when the lower level data is not unique is that we always need to search the lower level data by having the project scope. This posits the following problems:

This makes the search query complicated and slows down the search significantly.
When the scope is not provided, such as in the search box for the site, the search returns multiple organisms, contigs, or features, which is confusing.
Users may not know what the project names, under which the organisms, contigs, or features reside.

Make Data Name Unique

Because of these problems, we have undertaken the following steps to make sure that all the data names are unique.

Naming Scheme

Binning Projects with REGULAR organisms

Project

The project name should be descriptive so that users can understand what that project is about. Project name must be unique, which results in a unique project slug. Project slug is constructed by concatenating the project name strings with underscore (_).
Scheme: Project Name (Slug: project_name)
Example: JSantini GMIN Biofilm (jsantini_gmin_biofilm)

Organism

Organism name is made up of project slug, taxonomy, and GC and coverage.
Scheme: Project_Slug_Taxonomy_gc#_cov#
Example: JSantini_GMIN_Biofilm_Gallionellales_54_217

Contig – Contig name is made up of project slug with a number sequence.
Scheme: Project_Slug_#
Example: JSantini_GMIN_Biofilm_1

Feature – Feature name appends an additional sequence number to the end of its contig name.
Scheme: Project_Slug_#_#
Example: JSantini_GMIN_Biofilm_1_1

Projects with CURATED organisms

Sometimes, organisms binned out by ggKbase goes through the curation process to refine the organism content. The curated organism gets reimported into the system associating with the original organism as the better version of it. To indicated that such organism is curated, “_curated” will be added in the name of the organism and its lower level data like the following:

Organism – Organism name prefixed with project slug and suffixed with _curated.
Ex: Project_Slug_organism_name_gc#_cov#_curated

Contig – Project slug suffixed by the number count.
Ex: Project_Slug_#_curated

Feature – Contig name appended with an additional number count.
Ex: Project_Slug_#_#_curated

Fix the Existing Data

Rename Data with Duplicate Names

We need to start by making all the existing data unique. The steps will involve finding the duplicate names at all levels of data structure. With the knowledge and permission of the project owners, the data names will be renamed. The search index in the Elasticsearch server will be updated.

Set Data Name Validation

To prevent data with duplicate names from entering into the system, we will set the validation in the application (Ruby on Rails) and in the database.

Modify Data Prep & Import

We will need to examine and modify the data preparation protocol outlined in these two posts: Data Preparation Metagenome and Data Preparation Curated Genome, in order to make sure that future data will have unique names and will not conflict with the names of the existing data.

Need to detail out this process!