All the data in ggKbase - projects, organisms, contigs, and features - must have unique names so that each every entity can be queried by its name without the need to provide a search scope in the query.
Historically, the only kind of data that ggKbase requires to have a unique name is the project because it's at the highest organizing level for all the lower level data, such as organisms, contigs, and features. Note: It is at the Project level that user permission is set.
The problem we face when the lower level data is not unique is that we always need to search the lower level data by having the project scope. This makes the search query complicated and slows down the search significantly. When the scope is not provided, such as in the search box for the site, the search returns multiple organisms, contigs, or features, which is confusing.
Make Data Name Unique
Because of these problems, we have undertaken the following steps to make sure that all the data names are unique.
Binning Projects with REGULAR organisms
The project name should be descriptive so that users can understand what that project is about. Project name must be unique, which results in a unique project slug. Project slug is constructed by concatenating the project name strings with dashes (-). This facilitated by the FriendlyId Ruby Gem (https://github.com/norman/friendly_id).
Scheme: Project Name (Slug: project_name)
Example: JSantini GMIN Biofilm (jsantini_gmin_biofilm)
Organism name is made up of project slug, taxonomy, and GC and coverage.
Contig - Contig name is made up of project slug with the keyword scaffold* and the number sequence.
* Note: The word scaffold is not required but obtained in the data assembly pipeline. It can be other words or omitted altogether.
Feature - Feature name appends an additional sequence number to the end of its contig name.
Projects with CURATED organisms
Sometimes, organisms binned out by ggKbase goes through the curation process to refine the organism content. The curated organism gets reimported into the system associating with the original organism as the better version of it. To indicated that such organism is curated, "_curated" will be added in the name of the organism and its lower level data like the following:
Organism - Organism name prefixed with project slug and suffixed with _curated.
Contig - Project slug suffixed by scaffold‡ and the number count.
‡ Note: The word scaffold is not required but obtained in the data assembly pipeline. It can be other words or omitted altogether.
Feature - Contig name appended with an additional number count.
Fix the Existing Data
Rename Data with Duplicate Names
We need to start by making all the existing data unique. The steps will involve finding the duplicate names at all levels of data structure. With the knowledge and permission of the project owners, the data names will be renamed. The search index in the Elasticsearch server will be updated.
Set Data Name Validation
To prevent data with duplicate names from entering into the system, we will set the validation in the application (Ruby on Rails) and in the database.
Modify Data Prep & Import
We will need to examine and modify the data preparation protocol outlined in these two posts: Data Preparation Metagenome and Data Preparation Curated Genome, in order to make sure that future data will have unique names and will not conflict with the names of the existing data.
Need to detail out this process!