Import Public Genomes for Analysis

ggKbase analysis of public genomes (or genomes from collaborators) for which we don’t have reads and assembly files requires an altered import preparation. The genomes will be concatenated into a mock metagenome, run through the normal metagenome preparation pipeline, and separated back into the original genomes using a scaff2bin file after import.

1. Put scaffold files for each genome in a folder that Rohan will create

2. Make a scaff2bin file

Make sure the genomes are named with the project name and genome name, with *.fasta as the filename extension:

PROJECTNAME_GENOMENAME.fasta

The scaffolds in each file should be named uniquely with the project name and genome name, and contain no other columns:

>PROJECTNAME_GENOMENAME_scaffold_1

Then create the scaff2bin file. You can use this script:

sh /home/asharrar/scripts/fasta2scaff2bin.sh

3. Concatenate the scaffold files

cat *.fasta > PROJECTNAME_scaffold.fa

4. Add read length & count to scaffold headers

The scaffold file needs to have headers that include the read length and count for import into ggKbase:

>PROJECTNAME_GENOMENAME_scaffold_32 read_length_150 read_count_13456

**If you want the coverage values to be accurate, and have access to the reads, follow the directions in Step 4 on the Data Preparation – Metagenome page**

If you don’t care about the coverage values being accurate (i.e. you’re not planning on doing manual curation) or you don’t have access to the reads, you can use fake read count values. Add the read length (real if you can find it) and read count (fake) to the file headers:

sed -i '/>/s/$/ read_length_150 read_count_1000/' PROJECTNAME_scaffold.fa

5. Predict genes with Prodigal

prodigal -i PROJECTNAME_scaffold.fa -o PROJECTNAME_scaffold.fa.genes -a PROJECTNAME_scaffold.fa.genes.faa -d PROJECTNAME_scaffold.fa.genes.fna -m -p meta

7. Predict 16S rRNA genes

/groups/banfield/software/pipeline/v1.1/scripts/16s.sh PROJECTNAME_scaffold.fa > PROJECTNAME_scaffold.fa.16s

8. Predict tRNA genes

/groups/banfield/software/pipeline/v1.1/scripts/trnascan_pusher.rb -i PROJECTNAME_scaffold.fa > /dev/null 2>&1

9. Annotate genes

Submit annotation searches to the cluster:

sbatch --wrap "/groups/banfield/software/pipeline/v1.1/scripts/cluster_usearch_wrev.rb -i /FULL/PATH/TO/PROJECTNAME_scaffold.fa.genes.faa -k -d kegg --nocluster"

sbatch --wrap "/groups/banfield/software/pipeline/v1.1/scripts/cluster_usearch_wrev.rb -i /FULL/PATH/TO/PROJECTNAME_scaffold.fa.genes.faa -k -d uni --nocluster"

sbatch --wrap "/groups/banfield/software/pipeline/v1.1/scripts/cluster_usearch_wrev.rb -i /FULL/PATH/TO/PROJECTNAME_scaffold.fa.genes.faa -k -d uniprot --nocluster"

When all your jobs are complete and off the cluster, gzip the annotation files:

gzip *.b6

Then convert each annotation file into a “b6+” file:

/shared/software/bin/annolookup.py PROJECTNAME_scaffold.fa.genes.faa-vs-kegg.b6.gz kegg > PROJECTNAME_scaffold.fa.genes.faa-vs-kegg.b6+

/shared/software/bin/annolookup.py PROJECTNAME_scaffold.fa.genes.faa-vs-uni.b6.gz uniref > PROJECTNAME_scaffold.fa.genes.faa-vs-uni.b6+

/shared/software/bin/annolookup.py PROJECTNAME_scaffold.fa.genes.faa-vs-uniprot.b6.gz uniprot > PROJECTNAME_scaffold.fa.genes.faa-vs-uniprot.b6+

10. Now you’re ready for Data Import

11. After data import**

Use the scaff2bin file to bin the project on ggKbase. On the project page, go to Batch Rebinning > Rebin File. Then click on Add file, choose your file, and Start.

IMPORTANT: Since we want to keep public data out of our BLAST databse, Ask Lily to set the project.blast_status to project.never_blasted!