Data Import into ggKbase
After the data preparation steps are completed, you will be left with a directory that looks similar to this (note that all files of interest have the common prefix BASE_NAME_scaffold_min1000* ):
> ls begin bt2 BASE_NAME_scaffold.fa BASE_NAME_scaffold_min1000.16S.cmsearch BASE_NAME_scaffold_min1000.fa BASE_NAME_scaffold_min1000.fa.16s BASE_NAME_scaffold_min1000.fa.genes BASE_NAME_scaffold_min1000.fa.genes.faa BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6+ BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6.gz BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6+ BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6.gz BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6+ BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6.gz BASE_NAME_scaffold_min1000.fa.genes.fna BASE_NAME_scaffold_min1000.fa.trnascan BASE_NAME_scaffold_min1000.fa.trnascan.fasta contig.fa end log mapped.log
>> Ask Rohan to move the final import files to assembly.d in the project's directory within /group/banfield/sequences/2020/. This is where they should be imported from.
All the BASE_NAME_scaffold_min1000* files, with their properly formatted names, are used during the ggKbase import steps...
Importance of Sample Metadata
We are trying to pay more attention to metadata associated with each sample. Instead of keeping these details in a spreadsheet, away from the data in ggKbase, we're incorporating it into the system. This is an ongoing discussion - if you think you have some metadata that belongs in ggKbase, submit a ticket.
Project Creation
If you have a small number of projects to create and are a member of the Banfield research network, you can create new projects by going here: http://ggkbase.berkeley.edu/projects/new.
Follow the instructions there for filling in the metadata.
If you have many (e.g. >3) new projects to create, download the spreadsheet template below and fill it out for your samples. Submit the file to Lily via slack for import.
**NOTE: Exporting as a .tsv from excel messes up the formatting, please use Google sheets to fill in your info.
(UCBerkeley shared only, so use your .berkeley.edu account.)
To find total basepairs assembled:
On biotite, use
sum-bp *min1000.fa
To find total read basepairs:
Find the number of reads in the second line of the idba_ud 'log' file. Multiply this number by the read length.