Data Download

You may want to download data and visualizations from ggKbase for the following purposes.

  • Store data locally on your computer
  • Analyze data using other analysis programs
  • Share data with colleagues

There are a number of ways that you can download data from ggKbase.

**If you want to download data for many organisms or projects at once, instead of clicking through all those pages, slack Lily to do it on the backend**

How to Download

To trigger download, you need to first update the data file by clicking on the red exclamation mark, which indicates that the download needs to be generated. The symbol will turn into a yellow exclamation mark during the generation process. Once the file is generated/updated, the symbol will turn into a green check circle, which indicates that the file is now ready to download.

Places to Download Data File

Project Information Page

The central location for downloading ggKbase data in various data/file types is found in the project landing/home page, e.g. http://ggbkase.berkeley.edu/[project-slug]. Besides typing in the project link manually, if you are traversing through the project hierarchy, you will see the link at the breadcrumb:

breadcrumb-project-link

Below is a screenshot of the project page that shows you the instructions for downloading files. The download links are grouped by project level and organism level data files.

Provides links to download project and organism files.

There are two other locations in which you can download data files.

Organisms List Page

The Organisms list page provide you a dropdown menu to download the project level data files. Look for the Download dropdown menu like this:

download-dropdown-project

When downloaded, these files are named as such:

Contigs(fasta)    14_0903_02_30cm.contigs.fa.gz
Genes(fasta)    14_0903_02_30cm.genes.fna.gz
Proteins(fasta)    14_0903_02_30cm.proteins.faa.gz
16S rRNA (fasta)    14_0903_02_30cm.16S.fna
Organisms (table)   14_0903_02_30cm.organism_info.tsv
Scaffolds to bin (table)   14_0903_02_30cm.scaffolds_to_bin.tsv
Contig taxonomy (table)   14_0903_02_30cm.contig-taxonomy.tsv

Examples of the first few lines in each file:

Contigs(fasta)
>14_0903_02_30cm_scaffold_22000 id=35997450 bin="14_0903_02_30cm_UNK"
TCATGCAGCTGACGACAACGCACCCGGCGCGCTGCGCCCAGCGCATCGCCGCGCTCGACC
TGGTCAGCAACGGGCGCGTCGAGTTCGCCACCGGCGAATCTGCCAGCATCACCAATTGAG
>14_0903_02_30cm_scaffold_22010 id=35997460 bin="14_0903_02_30cm_UNK"
GAGCCGGCGCCAGGCGCGTCAGCGCGGCCAGGGTCATTTCCGTTTCGCCGAAGCGCACCG
GTCAGCGGCAGCTCGATGTCGAGAATCAGGTCCAGGTTCGCCGGCGTGTTGGCGGCAGCG

Genes(fasta)
>14_0903_02_30cm_scaffold_22000_1 Reverse transcriptase-RNase H-integrase n=1 Tax=Rhodotorula glutinis (strain ATCC 204091 / IIP 30 / MTCC 1151) RepID=G0SUI9_RHOG2 id=147996927 bin="14_0903_02_30cm_UNK" species=Rhodosporidium toruloides genus=Rhodosporidium taxon_order=Sporidiobolales taxon_class=Microbotryomycetes phylum=Basidiomycota organism_tax=unknown
ATGGCTAAACCAATGAGACCCAGCGAATATGATGGAAAAACTCGTGACGCTCGAACTGTC
GAAGCATGGCTTATTAGAATGACCACGTATTTGACGCTTACTAACACTGCGGACAATCGA
>14_0903_02_30cm_scaffold_22010_4 Cobyrinic acid ac-diamide synthase; K04562 flagellar biosynthesis protein FlhG id=12556372 bin=CNBR_ACIDO species=Holophaga foetida genus=Holophaga taxon_order=Holophagales taxon_class=Holophagae phylum=Acidobacteria tax=CNBR_ACIDO organism_group=Acidobacteria organism_desc=why is coverage listed as 1? id=147997008 bin="14_0903_02_30cm_UNK" species=BJP_IG2102_Syntrophobacterales_60_12 genus=unknown taxon_order=Syntrophobacterales taxon_class=Deltaproteobacteria phylum=Proteobacteria organism_tax=unknown
ATGAGTCCCACCCCCACGTCCCCGCGCCGCCCGATCAGCATCGCCGTCACGAGCGGCAAG
GGGGGCGTTGGCAAGACCAGCGTCGCCGTGAACCTGGCGGTCGCGTTGGCGCGGCTGCGC

Proteins(fasta)
>14_0903_02_30cm_scaffold_22000_1 Reverse transcriptase-RNase H-integrase n=1 Tax=Rhodotorula glutinis (strain ATCC 204091 / IIP 30 / MTCC 1151) RepID=G0SUI9_RHOG2 id=147996927 bin="14_0903_02_30cm_UNK" species=Rhodosporidium toruloides genus=Rhodosporidium taxon_order=Sporidiobolales taxon_class=Microbotryomycetes phylum=Basidiomycota organism_tax=unknown
MAKPMRPSEYDGKTRDARTVEAWLIRMTTYLTLTNTADNRKVELASSYLAGDAFEWYIDN
QTVLLVGTFDGFKTALRDRFVPQNHKSITYSQYKGLTQGNLSISEYSIKFKALADQIPDL
>14_0903_02_30cm_scaffold_22010_3 Tax=RIFCSPLOWO2_12_FULL_Acidobacteria_67_14b_curated id=147997007 bin="14_0903_02_30cm_UNK" species=RIFCSPLOWO2_12_FULL_Acidobacteria_67_14b_curated genus=unknown taxon_order=unknown taxon_class=unknown phylum=Acidobacteria organism_tax=unknown
MRMTPREIDHTERDRLVVAHIGLVKALAHRLAQRLPPQVEIPDLISIGVLGLMDAASRYR
ASLGVPFDAFARRRVQGAMLDALRELDWAPRSLRKLRREPTEEEIAAELNMTPAAYGRSL

16S rRNA (fasta)
>14_0903_02_30cm_scaffold_381542_16S_1 16S ribosomal RNA (16S rRNA) id=149423351 bin="14_0903_02_30cm_UNK" organism_tax=unknown
GAGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGAGGTTGGGTTAAGTCCC
GCAACGAGCGCAACCCTTGCCTTTAGTTGCCATCATTCAGTTGGGCACTCTAAAGGGACT
>14_0903_02_30cm_scaffold_383665_16S_1 16S ribosomal RNA (16S rRNA) id=149423352 bin="14_0903_02_30cm_UNK" organism_tax=unknown
AGGCCCCTAAGGAGTGACTGGTGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAA
AAGGCCCCTAAGGAGTGACTGGTGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGA

Organisms (table)
name code taxonomy description bin length GC% coverage # contigs # features longest contig RP Inventory (total: 55) RP multiple BSCG Inventory (total: 51) BSCG multiple ASCG Inventory (total: 38) ASCG multiple curation status completion status
14_0903_02_30cm_UNK 14_0903_02_30cm_UNK unknown 832801812 62.53 5.11 458951 1171925 98379 55 55 51 51 35 32 Uncurated genome megabin
14_0903_02_30cm_Sphingomonadales_156_68_15 14_0903_02_30cm_Sphingomonadales_68_15 Sphingomonadales, Alphaproteobacteria, Proteobacteria, Bacteria 3028871 67.56 14.51 3 3046 2650155 52 1 51 0 12 0 Uncurated genome near complete

Scaffolds to bin (table)
scaffold_name   bin     organism taxonomy
14_0903_02_30cm_scaffold_22000  14_0903_02_30cm_UNK     unknown
14_0903_02_30cm_scaffold_2884 14_0903_02_30cm_Solirubrobacterales_J_71_8 Actinobacteria, Actinobacteria, Bacteria

Contig taxonomy (table)
Contig name     Size (bp)       Coverage        GC %    Taxonomy winner Winner %        Species winner  Species winner %        Genus winner    Genus winner %  Order winner    Order winner %  Class winner    Class winner %  Phylum winner   Phylum winner % Domain winner   Domain winner %
14_0903_02_30cm_scaffold_22000  5297    18.84   46.65   Rhodosporidium toruloides       1.0     Rhodosporidium toruloides       1.0     Rhodosporidium  1.0     Sporidiobolales 1.0     Microbotryomycetes      1.0     Basidiomycota   1.0     Fungi   1.0
14_0903_02_30cm_scaffold_22001  10189   7.2     68.81   Bacteria        1.0     RIFCSPLOWO2_12_FULL_RIF_CHLX_71_12_curated      0.33    unknown 0.75    unknown 0.83    unknown 0.75    Chloroflexi     0.33    Bacteria        1.0

Individual Organism Page

Likewise, the individual Organism landing page also provides a dropdown list of download links but for an individual organism’s download files.

download-dropdown-organism

When downloaded, the file names are as such:

Contigs(fasta)   14_0903_02_30cm_Euryarchaeota_215_64_24.contigs.fa
Genes(fasta)   14_0903_02_30cm_Euryarchaeota_215_64_24.genes.fna
Proteins(fasta)   14_0903_02_30cm_Euryarchaeota_215_64_24.proteins.faa
Contig taxonomy (table)   14_0903_02_30cm_Euryarchaeota_215_64_24.contig-taxonomy.tsv
Features (table)   14_0903_02_30cm_Euryarchaeota_215_64_24.ql
Genbank   14_0903_02_30cm_Euryarchaeota_215_64_24.gbk

Examples of the first few lines in each file:

Contigs(fasta)
>14_0903_02_30cm_scaffold_22369 id=35997819 bin="14_0903_02_30cm_Euryarchaeota_215_64_24"
CCCCTTATGTGAATGACTACGCCTTCCTCTGGGAGAGCGACCGAGCGACAGGGATTTCCG
CCGGGCTCCCGGCGGGATCGTGCCACGTCGTGCGCCATCCGTTCGACCCCTTTTTCCTCG
>14_0903_02_30cm_scaffold_4098 id=35998548 bin="14_0903_02_30cm_Euryarchaeota_215_64_24"
ATCCACGTGCTCGCTCACGTCGCCTTGGACACCGATCACGCCGATTCGCATCGGTCGACT
TAGGGCGGCAGCCGATTAAAACGATTTGGGGACGCCCGGTCGTTCAGCTGTTCTCGCGGT

Genes(fasta)
>14_0903_02_30cm_scaffold_22369_1 hypothetical protein n=1 Tax=Rhodocyclaceae bacterium RZ94 RepID=UPI00037C713E id=148003626 bin="14_0903_02_30cm_Euryarchaeota_215_64_24" organism_tax=Euryarchaeota, Archaea
CCTTATGTGAATGACTACGCCTTCCTCTGGGAGAGCGACCGAGCGACAGGGATTTCCGCC
GGGCTCCCGGCGGGATCGTGCCACGTCGTGCGCCATCCGTTCGACCCCTTTTTCCTCGAT
>14_0903_02_30cm_scaffold_22369_2 hypothetical protein id=148003627 bin="14_0903_02_30cm_Euryarchaeota_215_64_24" organism_tax=Euryarchaeota, Archaea
ATGGTTCCCTCCGAACGACCTCCCGCGGCCGGTGCGACCTTCCCGTACATCGGCCTCGCG
GTGGCCGTCCTCGCACTGTATGCGATCCTCGCGGTCACGATGCCTCTGAATCCCTATCGG

Proteins(fasta)
>14_0903_02_30cm_scaffold_22369_1 hypothetical protein n=1 Tax=Rhodocyclaceae bacterium RZ94 RepID=UPI00037C713E id=148003626 bin="14_0903_02_30cm_Euryarchaeota_215_64_24" organism_tax=Euryarchaeota, Archaea
PYVNDYAFLWESDRATGISAGLPAGSCHVVRHPFDPFFLDRRGAGLGSRLSGAFGPVPRR
VIFSGPPEATRGSRDVVQLPSVLPRDPPTQVVLLLRDGRFPAPVVKRKRIGVHEVLAVHG
>14_0903_02_30cm_scaffold_22369_2 hypothetical protein id=148003627 bin="14_0903_02_30cm_Euryarchaeota_215_64_24" organism_tax=Euryarchaeota, Archaea
MVPSERPPAAGATFPYIGLAVAVLALYAILAVTMPLNPYRAAVALVAFFAMGYCTLGLVA
GGRIPMSVAEILAFTVGLTILITALSALAVSIVGIPITEFAVVIVGLPLAVIAFLLRRPA

Contig taxonomy (table)
Contig name     Size (bp)       Coverage        GC %    Taxonomy winner Winner %        Species winner  Species winner %        Genus winner    Genus winner %  Order winner    Order winner %  Class winner    Class winner %  Phylum winner   Phylum winner % Domain winner   Domain winner %
14_0903_02_30cm_scaffold_4098   12145   17.44   60.35   Archaea 0.91    RBG_19FT_COMBO_Euryarchaeota_69_17_curated 0.27    unknown 0.55    unknown 0.55    unknown 0.55    Euryarchaeota   0.45    Archaea 0.91
14_0903_02_30cm_scaffold_4370   11798   30.33   64.66   Euryarchaeota   0.94    RBG_19FT_COMBO_Euryarchaeota_69_17_curated      0.47    unknown 1.0     unknown 1.0     unknown 1.0     Euryarchaeota   0.94    Archaea 0.94



Features (table)
14_0903_02_30cm_scaffold_22369_1        14_0903_02_30cm_scaffold_22369  1       5248    63.99   21.49   11 3       704     u       5.2e-09 69      hypothetical protein n=1 Tax=Rhodocyclaceae bacterium RZ94 RepID=UPI00037C713E  unknown smd:Smed_5686;UniRef100_UPI00037C713E
14_0903_02_30cm_scaffold_22369_4        14_0903_02_30cm_scaffold_22369  4       5248    63.99   21.49   11 3662    4219    c       6.6e-37 162     Tax=RBG_16_Euryarchaeota_68_13_curated  RBG_16_Euryarchaeota_68_13_curated, unknown, unknown, unknown, Euryarchaeota, Archaea   86939202

Genbank
LOCUS       14_0903_02_30cm_scaffold_22369       5248 bp    DNA     linear   BCT MAR-29-2017
DEFINITION  14_0903_02_30cm_scaffold_22369, contig.
ACCESSION   14_0903_02_30cm_scaffold_22369
VERSION     14_0903_02_30cm_scaffold_22369.1  GI:
KEYWORDS    .
SOURCE      14_0903_02_30cm_Euryarchaeota_215_64_24
    ORGANISM  14_0903_02_30cm_Euryarchaeota_215_64_24
         
COMMENT     Data sourced from ggkbase. For additional details see
            http://ggkbase.berkeley.edu/organisms/47637
FEATURES             Location/Qualifiers
     source          1..5248
                     /organism="14_0903_02_30cm_Euryarchaeota_215_64_24"
                     /mol_type="genomic DNA"
     gene            <3..704
                     /locus_tag="14_0903_02_30cm_scaffold_22369_1"
     CDS             <3..704
                     /product="hypothetical protein n=1 Tax=Rhodocyclaceae
                     bacterium RZ94 RepID=UPI00037C713E"
                     /codon_start=3
                     /transl_table=11
                     /translation="PYVNDYAFLWESDRATGISAGLPAGSCHVVRHPFDPFFLDRRGA
                     GLGSRLSGAFGPVPRRVIFSGPPEATRGSRDVVQLPSVLPRDPPTQVVLLLRDGRFPA
                     PVVKRKRIGVHEVLAVHGLITREELRAIYRTSHVAVFPYRFVRTGLPLVVLEAVAAGL
                     PVVTTRIHPIRELEGRTGLVFARPRDPPDIARAIESAFDDAQRAAVVRKNDEWIRTTP
                     DWSTVAKNFVSFVRR*"
                     /db_xref="smd:Smed_5686"
                     /db_xref="UniRef100_UPI00037C713E"
     gene            770..1669
                     /locus_tag="14_0903_02_30cm_scaffold_22369_2"
     CDS             770..1669
                     /product="hypothetical protein"
                     /codon_start=1
                     /transl_table=11
                     /translation="MVPSERPPAAGATFPYIGLAVAVLALYAILAVTMPLNPYRAAVA
                     LVAFFAMGYCTLGLVAGGRI
                    /db_xref="86935684"  gene            complement(4247..4693)
                     /locus_tag="14_0903_02_30cm_scaffold_22369_5"
     CDS             complement(4247..4693)
                     /product="Tax=RBG_16_Euryarchaeota_68_12_curated"
                     /codon_start=1
                     /transl_table=11
                     /translation="MGRESSDALEQAALSFLSGIDPDLGLDAALFVRRTGLVLASWMR
                     EGIRLDVVSVMAATMLASVDTIIESVGGPTPEVISVDTDAHQILATKVNSRAFLVVIA
                     PKKVSRTVVRKTMRGLNARLAAAASKSTHLHVEETEKQRVNVRPPR*"
                     /db_xref="86935684"
ORIGIN
        1 CCCCTTATGT GAATGACTAC GCCTTCCTCT GGGAGAGCGA CCGAGCGACA GGGATTTCCG
       61 CCGGGCTCCC GGCGGGATCG TGCCACGTCG TGCGCCATCC GTTCGACCCC TTTTTCCTCG
      121 ATCGCAGGGG CGCGGGCCTC GGATCCCGAC TGTCCGGGGC CTTCGGACCC GTTCCTCGGC
      181 GCGTCATCTT CTCGGGGCCT CCCGAGGCCA CTCGGGGGAG CCGCGACGTG GTCCAGCTTC
      241 CGAGCGTCCT TCCCCGGGAC CCTCCGACGC AGGTCGTCCT GCTCCTGCGT GACGGGCGAT
      301 TTCCCGCCCC GGTCGTCAAG CGGAAGCGGA TCGGCGTCCA CGAGGTCCTC GCCGTCCACG
      361 GCCTGATCAC GCGGGAGGAG TTGCGCGCGA TCTACCGCAC ATCCC

Note- the Features Table for a given Organism contains:

feature_ID, contig_ID, feature_number, contig_length, GC%, coverage, codon table, winning_taxonomy_level, begin_position, end_position, complement (uncomplemented or complemented), E-value, bit_score, value_of_annotation

In order to download the 16S for a specific organism for example, click on “rRNA” for that organism, then click on the feature which may then be downloaded.

Why are some files *.fa and others *.fna or *.fasta?

On biotite and ggKbase:

.fa or .fasta – contig or genome sequences

.fna – gene sequences

.faa – amino acids /proteins