Data Naming Convention

Abstract

All the data in ggKbase - projects, organisms, contigs, and features - must have unique names so that each every entity can be queried by its name without the need to provide a search scope in the query.

History

Historically, the only kind of data that ggKbase requires to have a unique name is the project because it's at the highest organizing level for all the lower level data, such as organisms, contigs, and features. Note: It is at the Project level that user permission is set.

Diagram showing the hierarchical structure of the data in ggKbase

The problem we face when the lower level data is not unique is that we always need to search the lower level data by having the project scope. This makes the search query complicated and slows down the search significantly. When the scope is not provided, such as in the search box for the site, the search returns multiple organisms, contigs, or features, which is confusing.

Make Data Name Unique

Because of these problems, we have undertaken the following steps to make sure that all the data names are unique.

Naming Scheme

Binning Projects with REGULAR organisms

Project

The project name should be descriptive so that users can understand what that project is about. Project name must be unique, which results in a unique project slug. Project slug is constructed by concatenating the project name strings with dashes (-). This facilitated by the FriendlyId Ruby Gem (https://github.com/norman/friendly_id).
Scheme: Project Name (Slug: project_name)
Example: JSantini GMIN Biofilm (jsantini_gmin_biofilm)

Organism

Organism name is made up of project slug, taxonomy, and GC and coverage.
Scheme: Project_Slug_Taxonomy_gc#_cov#
Example: JSantini_GMIN_Biofilm_Gallionellales_54_217

Contig - Contig name is made up of project slug with the keyword scaffold* and the number sequence.
* Note: The word scaffold is not required but obtained in the data assembly pipeline. It can be other words or omitted altogether.
Scheme: Project_Slug_scaffold_#
Example: JSantini_GMIN_Biofilm_scaffold_1

Feature - Feature name appends an additional sequence number to the end of its contig name.
Scheme: Project_Slug_scaffold_#_#
Example: JSantini_GMIN_Biofilm_scaffold_1_1

Projects with CURATED organisms

Sometimes, organisms binned out by ggKbase goes through the curation process to refine the organism content. The curated organism gets reimported into the system associating with the original organism as the better version of it. To indicated that such organism is curated, "_curated" will be added in the name of the organism and its lower level data like the following:

Organism - Organism name prefixed with project slug and suffixed with _curated.
Ex: Project_Slug_organism_name_gc#_cov#_curated

Contig - Project slug suffixed by scaffold‡ and the number count.
‡ Note: The word scaffold is not required but obtained in the data assembly pipeline. It can be other words or omitted altogether.
Ex: Project_Slug_scaffold_#_curated

Feature - Contig name appended with an additional number count.
Ex: Project_Slug_scaffold_#_#_curated

Fix the Existing Data

Rename Data with Duplicate Names

We need to start by making all the existing data unique. The steps will involve finding the duplicate names at all levels of data structure. With the knowledge and permission of the project owners, the data names will be renamed. The search index in the Elasticsearch server will be updated.

Set Data Name Validation

To prevent data with duplicate names from entering into the system, we will set the validation in the application (Ruby on Rails) and in the database.

Modify Data Prep & Import

We will need to examine and modify the data preparation protocol outlined in these two posts: Data Preparation Metagenome and Data Preparation Curated Genome, in order to make sure that future data will have unique names and will not conflict with the names of the existing data.

Need to detail out this process!