Bulk Genome Curation

See http://ggkbase-help.berkeley.edu/how-to/genome-curation/ and https://ggkbase-help.berkeley.edu/overview/data-preparation-curated-genome/ for more thorough instructions on the steps.

Overview

  1. Run ra2 on each genome.
  2. Run prodigal on each genome using the -single setting. https://ggkbase-help.berkeley.edu/overview/data-preparation-curated-genome/
  3. Concatenate all files (fa, fa.genes, fa.genes.faa, fa.genes.fna), run 16s, trnascan, usearch, clean.rb, anno-lookup etc using concatenated files.
  4. Use break_bulk_bin.rb to unshuffle genomes – see below for specifics
  5. Submitting to ggkbase – see below for specifics

Break_bulk_bin.rb

Requires a scaffold to bin file that you create. scaff \t bin

Scaff \t bin file must not have other white space aside from the \t. Do not include read_count or read_length in your parsing file.

Documentation on this script:

user@biotite /opt/bin/bio $ break_bulk_bin.rb  -h

Summary:

 break_bulk_bin.rb: Unshuffle combined ggKbase submissions into individual directories.

Synopsis:

 break_bulk_bin.rb --lookup=<scaffold to bin lookup file>

                --outdir=<where will new directories will be created: default=current dir>

                --basename=<basename of the combined files to process>

Options:

 -l, --lookup=<s>      name of scaff2bin file (required string)

 -o, --outdir=<s>      name of top-level output dir (required string) (default: .)

 -b, --basename=<s>    basename of combined input files (required string)

 -v, --version         Print version and exit

 -h, --help            Show this message

For example:

break_bulk_bin.rb -l scaff2bin.tsv -o curated -b basename_of_input_files

 

Submitting to ggkbase

Keep break_bulk_bin.rb on the settings above and importing will be easy for everyone! Just submit the path to the directory containing your output directories for each genome.

Make sure it is all formatted like this (ie don’t change the defaults),.

anamox3_Acidobacteria_71_4.curated/
anamox3_BJP_IG2069_Ignavibacteriae_38_11_30_7.curated/
anamox3_Pedosphaera_parvula_66_5.curated/

When the Tech Team imports your genomes, they will have the basename and _curated on the end. If you would like to change the names you may do this on ggkbase, or you can create a spreadsheet  for import with your desired names as on other help page:

http://ggkbase-help.berkeley.edu/overview/data-preparation-curated-genome/