Data Preparation – Curated Genome

This post explains what needs to take place to import an already manually curated genome into ggKbase.

How to curate a genome: https://ggkbase-help.berkeley.edu/how-to/genome-curation/

How to curate many genomes: https://ggkbase-help.berkeley.edu/how-to/bulk-genome-curation/

Step 1: Rename Fasta File

Make your curated genome distinct from the original organism in ggKbase by adding "_curated" or "_genome_final" to the end of the basename. Likewise, rename all of scaffolds in the curated genome to ">basename_curated" or ">basename_final". If one of the contigs was unchanged during the manual curation, then it will be a renamed duplicate of what is already in ggKbase.

You can do this using sed, for example:

> sed 's/basename/basename_curated/g' genome.fa > genome_curated.fa

Step 2: Gene and Small RNA Prediction

For annotating the assembly output, we need to predict genes, small RNAs, and then functionally annotate the predicted open reading frames.  We use the Prodigal for gene prediction.

Running prodigal:

> prodigal -i genome_curated.fa -o genome_curated.fa.genes -a genome_curated.fa.genes.faa -d genome_curated.fa.genes.fna -m -p single

NOTE: make sure to run prodigal with the single setting for a single genome! (as opposed to meta for metagenome)

The "-m" flag will prevent prodigal from making predictions that span "N"-containing stretches. This command will generate 3 output files: the gene predictions in GFF format; the corresponding protein sequences; the corresponding DNA sequences.

Finding the 16S rRNA genes:

Using the 16s.sh shell wrapper script which runs Chris Brown's 16SfromHMM.py script with the default settings and the ssu-align-0p1.1.cm database.  This shell script is run as follows (you must follow this naming scheme:

> /groups/banfield/software/pipeline/v1.1/scripts/16s.sh genome_curated.fa > genome_curated.fa.16s

Predicting tRNA genes using the tRNAscanSE wrapper script:

> /groups/banfield/software/pipeline/v1.1/scripts/trnascan_pusher.rb -i genome_curated.fa > /dev/null 2>&1

(The reason for the redirection to /dev/null is that the tRNAscanSE perl script causes a ton of warnings from the perl interpreter due to coding style.  These don't affect the output, but are obnoxious nonetheless. Redirecting to /dev/null will hide them from your terminal window.)

Step 3: Annotation

We annotate the predicted proteins by doing similarity searches using usearch, comparing each protein sequence against the following databases:

  • KEGG (curated database with excellent metabolic pathways metadata)
  • UniRef100 (curated dataset derived from UniProt; provides functional and taxonomic information)
  • UniProt (a comprehensive, non-redundant database derived from numerous sources)

Annotation searches can be done on the cluster using the cluster_usearch_wrev.rb script. You will need to submit 3 jobs per sample to the cluster queue (one for each database).  It may take a while to complete, depending on the size of you input database and the status of the queue. You may only submit 30 jobs at a time to the cluster. This is how they should be run:

> sbatch --wrap "/groups/banfield/software/pipeline/v1.1/scripts/cluster_usearch_wrev.rb -i /FULL/PATH/TO/genome_curated.fa.genes.faa -k -d kegg --nocluster"
> sbatch --wrap "/groups/banfield/software/pipeline/v1.1/scripts/cluster_usearch_wrev.rb -i /FULL/PATH/TO/genome_curated.fa.genes.faa -k -d uni --nocluster"
> sbatch --wrap "/groups/banfield/software/pipeline/v1.1/scripts/cluster_usearch_wrev.rb -i /FULL/PATH/TO/genome_curated.fa.genes.faa -k -d uniprot --nocluster"

When all your jobs are complete and off the cluster, you will know for two reasons:

  1. The cluster queue doesn't have any jobs owned by you still running or queued (type squeue to see your jobs on the cluster)
  2. In the output directory, there are no *.fix intermediate files present.

After this, gzip your annotation files:

> gzip *.b6

Now each output b6 file is converted into a "b6+" file, which contains the actual annotation text from the database hit:

> /shared/software/bin/annolookup.py genome_curated.fa.genes.faa-vs-kegg.b6.gz kegg > genome_curated.fa.genes.faa-vs-kegg.b6+
> /shared/software/bin/annolookup.py genome_curated.fa.genes.faa-vs-uni.b6.gz uniref > genome_curated.fa.genes.faa-vs-uni.b6+
> /shared/software/bin/annolookup.py genome_curated.fa.genes.faa-vs-uniprot.b6.gz uniprot > genome_curated.fa.genes.faa-vs-uniprot.b6+

Note that annolookup.py reads a gzip'd file, but the output should be uncompressed for import.

Summary

After you have gone through all of the above steps, you should have a directory that looks similar to the following:

> ls
genome_curated.fa
genome_curated.16S.cmsearch
genome_curated.fa.16s
genome_curated.fa.genes
genome_curated.fa.genes.faa
genome_curated.fa.genes.faa-vs-kegg.b6+
genome_curated.fa.genes.faa-vs-kegg.b6.gz
genome_curated.fa.genes.faa-vs-uni.b6+
genome_curated.fa.genes.faa-vs-uni.b6.gz
genome_curated.fa.genes.faa-vs-uniprot.b6+
genome_curated.fa.genes.faa-vs-uniprot.b6.gz
genome_curated.fa.genes.fna
genome_curated.fa.trnascan
genome_curated.fa.trnascan.fasta

**Note again the naming conventions used: using anything else but this convention will likely prevent the next step, ggKbase import...

Step 4: Import to ggKbase

Slack Allison to run the import.

For a single genome, please include:

  1. ggKbase project name
  2. Original genome name
  3. Name of new genome (add "_curated" or "_final" to make name of new genome distinct from uncurated version)
  4. Path to files on biotite

For many genomes, please provide a csv formatted as such: 

path, file prefix, project slug (URL code), old name, new name

/path/to/directory/, genome_curated, project_slug, old_name, new_name

Info on the web interface for curated genomes on ggKbase:

Curated Genome Import User Interface