6 articles Overview

This section provides an overview of the organizational structure and key components of ggKbase.

Data Preparation – Metagenome

This post explains what needs to take place before data gets imported into ggKbase. Step 1: Read File Organization Once the sample has been sequenced, you need to download and rename the read files for processing & QC. Download: check with Rohan about where to download the files. A specific directory within /groups/banfield/sequences/2020 must be created for…

Data Preparation – Curated Genome

This post explains what needs to take place to import an already manually curated genome into ggKbase. How to curate a genome: https://ggkbase-help.berkeley.edu/how-to/genome-curation/ How to curate many genomes: https://ggkbase-help.berkeley.edu/how-to/bulk-genome-curation/ Step 1: Rename Fasta File Make your curated genome distinct from the original organism in ggKbase by adding “_curated” or “_genome_final” to the end of the basename. Likewise,…

Data Structure

The genomic data in ggKbase is organized by the following data types and hierarchy: ProjectGroup Project Organism Contig Sequence Feature Annotation dbXref Project Navigation You can go to the Projects landing page by clicking on the Projects link on the top menu anywhere on the website. The dropdown menu (shown as a list image) next to…

Data Import

Data Import into ggKbase After the data preparation steps are completed, you will be left with a directory that looks similar to this (note that all files of interest have the common prefix BASE_NAME_scaffold_min1000* ): > ls begin bt2 BASE_NAME_scaffold.fa BASE_NAME_scaffold_min1000.16S.cmsearch BASE_NAME_scaffold_min1000.fa BASE_NAME_scaffold_min1000.fa.16s BASE_NAME_scaffold_min1000.fa.genes BASE_NAME_scaffold_min1000.fa.genes.faa BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6+ BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6.gz BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6+ BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6.gz BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6+ BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6.gz BASE_NAME_scaffold_min1000.fa.genes.fna BASE_NAME_scaffold_min1000.fa.trnascan BASE_NAME_scaffold_min1000.fa.trnascan.fasta contig.fa end…

Data Storage

Keep track of your disk storage footprint on biotite – usage must be monitored. Storage space is not endless — important files should be compressed and intermediate files should be removed. It is important to not store primary data in your home directory. This storage volume is heavily used and has the ability to impact…

Data Naming Convention

Abstract All the data in ggKbase – projects, organisms, contigs, and features – must have unique names so that every entity can be queried by its name without the need to provide a search scope in the query. History Historically, the only kind of data that ggKbase requires to have a unique name is the…