This post explains what needs to take place to import an already manually curated genome into ggKbase.
How to curate a genome: https://ggkbase-help.berkeley.edu/how-to/genome-curation/
How to curate many genomes: https://ggkbase-help.berkeley.edu/how-to/bulk-genome-curation/
Step 1: Rename Fasta File
Make your curated genome distinct from the original organism in ggKbase by adding “_curated” or “_genome_final” to the end of the basename. Likewise, rename all of scaffolds in the curated genome to “>basename_curated” or “>basename_final”. If one of the contigs was unchanged during the manual curation, then it will be a renamed duplicate of what is already in ggKbase.
You can do this using sed, for example:
> sed 's/basename/basename_curated/g' genome.fa > genome_curated.fa
Step 2: Gene and Small RNA Prediction
For annotating the assembly output, we need to predict genes, small RNAs, and then functionally annotate the predicted open reading frames. We use the Prodigal for gene prediction.
Running prodigal:
> prodigal -i genome_curated.fa -o genome_curated.fa.genes -a genome_curated.fa.genes.faa -d genome_curated.fa.genes.fna -m -p single
NOTE: make sure to run prodigal with the single setting for a single genome! (as opposed to meta for metagenome)
The “-m” flag will prevent prodigal from making predictions that span “N”-containing stretches. This command will generate 3 output files: the gene predictions in GFF format; the corresponding protein sequences; the corresponding DNA sequences.
Finding the 16S rRNA genes:
Using the 16s.sh shell wrapper script which runs Chris Brown’s 16SfromHMM.py script with the default settings and the ssu-align-0p1.1.cm database. This shell script is run as follows (you must follow this naming scheme:
> /groups/banfield/software/pipeline/v1.1/scripts/16s.sh genome_curated.fa > genome_curated.fa.16s
Predicting tRNA genes using the tRNAscanSE wrapper script:
> /groups/banfield/software/pipeline/v1.1/scripts/trnascan_pusher.rb -i genome_curated.fa > /dev/null 2>&1
(The reason for the redirection to /dev/null is that the tRNAscanSE perl script causes a ton of warnings from the perl interpreter due to coding style. These don’t affect the output, but are obnoxious nonetheless. Redirecting to /dev/null will hide them from your terminal window.)
Step 3: Annotation
We annotate the predicted proteins by doing similarity searches using usearch, comparing each protein sequence against the following databases:
- KEGG (curated database with excellent metabolic pathways metadata)
- UniRef100 (curated dataset derived from UniProt; provides functional and taxonomic information)
- UniProt (a comprehensive, non-redundant database derived from numerous sources)
Annotation searches must be run on the cluster using the updated script developed by Rohan:
/groups/banfield/users/rohan/ggdb/cluster_usearch_wrev-copy.edit.python.py
You must submit three jobs per genome, one per database. It may take time depending on input size and queue status. You may only submit 30 jobs at a time.
Run the jobs as follows:
> python3 /groups/banfield/users/rohan/ggdb/cluster_usearch_wrev-copy.edit.python.py -i /FULL/PATH/TO/genome_curated.fa.genes.faa -d kegg -t 48
> python3 /groups/banfield/users/rohan/ggdb/cluster_usearch_wrev-copy.edit.python.py -i /FULL/PATH/TO/genome_curated.fa.genes.faa -d uni -t 48
> python3 /groups/banfield/users/rohan/ggdb/cluster_usearch_wrev-copy.edit.python.py -i /FULL/PATH/TO/genome_curated.fa.genes.faa -d uniprot -t 48
Make sure the input file corresponds to the prodigal output:
genome_curated.fa.genes.faa
Summary
After you have gone through all of the above steps, you should have a directory that looks similar to the following:
> ls genome_curated.fa genome_curated.16S.cmsearch genome_curated.fa.16s genome_curated.fa.genes genome_curated.fa.genes.faa genome_curated.fa.genes.faa-vs-kegg.b6+ genome_curated.fa.genes.faa-vs-uni.b6+ genome_curated.fa.genes.faa-vs-uniprot.b6+ genome_curated.fa.genes.fna genome_curated.fa.trnascan genome_curated.fa.trnascan.fasta
**Note again the naming conventions used: using anything else but this convention will likely prevent the next step, ggKbase import…
Step 4: Import to ggKbase
Slack Leylen that will run the import as follows:
RAILS_ENV=production bundle exec thor importer:curated [PATH] -p [ggkbase PROJECT_NAME] -n [NEW_ORG_NAME] -o [OLD_ORG_NAME] -f[NEW_ORG_NAME]
Please include:
- ggKbase project name (for that particular project, not the group project)
- Original genome name
- Name of new genome (add “_curated” or “_final” to make name of new genome distinct from uncurated version)
- Path to files for curated genome on biotite
Info on the web interface for curated genomes on ggKbase: