Data Import into ggKbase
After the data preparation steps are completed, you will be left with a directory that looks similar to this (note that all files of interest have the common prefix BASE_NAME_scaffold_min1000* ):
> ls begin bt2 BASE_NAME_scaffold.fa BASE_NAME_scaffold_min1000.16S.cmsearch BASE_NAME_scaffold_min1000.fa BASE_NAME_scaffold_min1000.fa.16s BASE_NAME_scaffold_min1000.fa.genes BASE_NAME_scaffold_min1000.fa.genes.faa BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6+ BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6.gz BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6+ BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6.gz BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6+ BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6.gz BASE_NAME_scaffold_min1000.fa.genes.fna BASE_NAME_scaffold_min1000.fa.trnascan BASE_NAME_scaffold_min1000.fa.trnascan.fasta contig.fa end log mapped.log
>> Ask Rohan to move the final import files to assembly.d in the project’s directory within /group/banfield/sequences/2020/. This is where they should be imported from.
All the BASE_NAME_scaffold_min1000* files, with their properly formatted names, are used during the ggKbase import steps…
Importance of Sample Metadata
We are trying to pay more attention to metadata associated with each sample. Instead of keeping these details in a spreadsheet, away from the data in ggKbase, we’re incorporating it into the system. This is an ongoing discussion – if you think you have some metadata that belongs in ggKbase, submit a ticket.
Project Creation
If you have a small number of projects to create and are a member of the Banfield research network, you can create new projects by going here: http://ggkbase.berkeley.edu/projects/new.
Follow the instructions there for filling in the metadata.
If you have many (e.g. >3) new projects to create, please indicate how many and when they will be ready for import on the following form: https://forms.gle/9YesrUYFMGsLrqTv8
When you are ready to import your data, download the spreadsheet template below and fill it out for your samples. Fill out the following form and slack Kaden: https://forms.gle/PUQ6H7Eq4hHwDZoo8
**NOTE: Exporting as a .tsv from excel messes up the formatting, please use Google sheets to fill in your info.
(UCBerkeley shared only, so use your .berkeley.edu account.)
ggkbase_bulk_upload_template_FIXED
To find total basepairs assembled:
On biotite, use
sum-bp *min1000.fa
To find total read basepairs:
Find the number of reads in the second line of the idba_ud ‘log’ file. Multiply this number by the read length.