Data Preparation – Metagenome

This post explains what needs to take place before data gets imported into ggKbase.

Step 1: Sample/Project Creation

A project in ggKbase is directly related to a sample taken in the field etc.  Projects are further organized into Project Groups, which roughly represent a study or site. For example, you have many sediment samples from a location and these samples can be different conditions, and/or they can be technical replicates etc.  During project creation on the ggkbase website (, you will be asked to select an existing Project Group or to define a new one.

*Project creation is available*  You will need the following information to complete a new project.

  1. Project name: corresponds to a sample name; can be free form but should be kept short
  2. Project slug: part of the URL; used to find your project on using a web link;  should be short and similar to your project name, but only containing A-Za-z0-9_- (i.e. no spaces or other non-web characters)
  3. Project type: can be either "binning" or "analysis"
  4. Project Group: represents the overall group this project belongs in; can either exist or can be created at this time; if this is a new project group, you'll need to define the ecosystem
  5. Sequencing Facility: short description of sequencing center or facility where the sample was sequenced
  6. Location: where was the sample taken; use existing location, or enter the closest city to where sample was taken
  7. Date of sample; when was this sample taken?
  8. Read file(s): location on our server where the read or reads are located
  9. Read processing: how were the reads processed; short description
  10. Read length: length of one read, pre-trimming
  11. Total read bp:  number of bp of reads going into the assembly
  12. Total assembled sequence: bp of sequence that resulted after assembly
  13. Assembly type: what did you use to assemble the reads; e.g. "IDBA_UD"
  14. Path to assembly: important field with the location on our server of the output assembly and annotation files, as well as the file basename of all the generated output
  15. Members: other members of ggkbase that should be added to the project on import

Step 2: Read File Processing

Once the sample has been sequenced, you need to download and process the read files.  If the data comes from JGI, it will usually already have been shuffled so forward/reverse reads are present in the same file. If not from JGI, the data are usually still separate, and in fact, they are usually still in non-concatenated chunks.

For the downloading step, be sure to check with BCT about where to download the files: a specific directory (or directories) must be created for the projects.  Reads are downloaded into a directory called "raw.d" and you should follow this model.  Similarly, there will be a directory for the assembled data called "assembly.d" but we'll discuss that later. In raw.d, create subdirectories for each sample.

Processing reads are from JGI

First, uncompress and unshuffle the reads file using -h
splits a paired-end fastq file into \1 and \2 reads
Usage: paired_reads.fastq output_prefix
paired_reads.fastq may be any list of files to be sent to cat, if contained in quotes and separated by spaces
(input can be gzip compressed - filename must end in .gz

For example, if your reads file is called "CZBC.6181.4.39404.fastq.gz", you would do this...

> gunzip CZBC.6181.4.39404.fastq.gz
> CZBC.6181.4.39404.fastq CZBC.6181.4.39404

This will create two new files...

> ls

At this point, the original read file can be removed.  Now proceed with the remaining steps below.

Processing reads from other facilities

By default, the Illumina platform will generate "chunks" of reads with 4 million reads per file. Your files may look something like this...

> ls
JBBB016D_TGACCA_L001_R1_001.fastq.gz  JBBB016D_TGACCA_L001_R2_001.fastq.gz
JBBB016D_TGACCA_L001_R1_002.fastq.gz  JBBB016D_TGACCA_L001_R2_002.fastq.gz
JBBB016D_TGACCA_L001_R1_003.fastq.gz  JBBB016D_TGACCA_L001_R2_003.fastq.gz
JBBB016D_TGACCA_L001_R1_004.fastq.gz  JBBB016D_TGACCA_L001_R2_004.fastq.gz

To process these files, they need to be uncompressed (for later) and combined. This can be done using the zcat command:

# process forward (R1) reads
> cat JBBB016D_TGACCA_L001_R1_001.fastq.gz JBBB016D_TGACCA_L001_R1_002.fastq.gz JBBB016D_TGACCA_L001_R1_003.fastq.gz JBBB016D_TGACCA_L001_R1_004.fastq.gz > JBBB016D_TGACCA_L001_R1.fastq.gz

# now process reverse (R2) reads
> cat JBBB016D_TGACCA_L001_R2_001.fastq.gz JBBB016D_TGACCA_L001_R2_002.fastq.gz JBBB016D_TGACCA_L001_R2_003.fastq.gz JBBB016D_TGACCA_L001_R2_004.fastq.gz > JBBB016D_TGACCA_L001_R2.fastq.gz

Now you need to link the read file information to your ggkbase project (link) for each sample/project. Upon doing this, you're ready to QC your reads.

Rename your files so that they look like this (matching basename with .1. and .2. for forward and reverse reads):



Processing reads containing human DNA "contaminants"

If your reads are generated from human microbiome samples, they need to be screened to remove any human DNA fragments.  This can be done with process_reads_bbmap.rb using the -m and -d options.

Assessing the Quality of Your Reads

Use the fastqc command in biotite and visualize the html file to assess the quality of your reads both before and after process_reads_bbmap.rb. For example:

/opt/bin/bio/FastQC/fastqc -o output_dir -f fastq seqfile.1.fastq.gz seqfile.2.fastq.gz

Step 3: Read QC

Read quality control involves 3 steps: removing Illumina adapters (BBTools), removing PhiX and other Illumina trace contaminants (BBTools), and sequence quality trimming (Sickle).  After this, a fasta-formatted file containing the remaining reads is needed for the assembly step.

All of these steps can be done using the process_reads_bbmap.rb script on

process_reads_bbmap.rb -h
  process_reads_bbmap.rb: Processes JGI reads by trimming adapters and
                  filtering for PhiX/Illumina contaminants and
                  ends with sickle quality trimming the reads;
                  can also create a fasta-formatted, shuffled
                  output ready for assembly

  process_reads_bbmap.rb --basename=<reads base file name; can be shuffled or separate>

  -b, --basename=<s>                   read file(s) base name
  -p, --processors=<i>                 number of CPU cores to use (default: 6)
  -r, --run-sickle, --no-run-sickle    run sickle quality trimmer (default: true)
  -k, --sickle-flag=<s>                if running sickle, set type flag (default: sanger)
  -c, --create-fa                      create fasta file for assembly
  -m, --mask-human                     remove reads mapping to humans (only for microbiomes)
  -d, --humandb=<s>                    database to use for human masking (default: )
  -s, --shuffled                       input is already shuffled
  -z, --zipped                         input is zipped
  -d, --dry-run                        Don't actually run anything, just report
  -v, --verbose                        show extra progress information
  -e, --version                        Print version and exit
  -h, --help                           Show this message

Usually default settings are fine, but be aware that BBTools and Sickle have lots of customizable parameters.  An example processing JGI read files is below. Note that if you have BOTH the unshuffled JGI downloaded file, and the output from step above, process_reads_bbmap.rb will find the unshuffled file and process it. You should therefore delete the original file downloaded from JGI after you have it un-shuffled...

> ls
> process_reads_bbmap.rb --basename=8871.5.114481.CGTACG -c

Note that this script will determine many of the options for you (e.g. zipped or not, shuffled or not, etc).  The dry-run option will show you exactly what would be run, without actually running the programs.  After running this program, your directory should look like this...

> ls
8871.5.114481.CGTACG.1.fastq.gz		    8871.5.114481.CGTACG_trim_clean.2.fastq.gz
8871.5.114481.CGTACG.2.fastq.gz		    8871.5.114481.CGTACG_trim_clean.PE.1.fastq.gz
8871.5.114481.CGTACG_trim.1.fastq.gz	    8871.5.114481.CGTACG_trim_clean.PE.2.fastq.gz
8871.5.114481.CGTACG_trim.2.fastq.gz	    8871.5.114481.CGTACG_trim_clean.PE.fa
8871.5.114481.CGTACG_trim_clean.1.fastq.gz  8871.5.114481.CGTACG_trim_clean.SR.fastq.gz

Note that the original file is gone (replaced by the unshuffled versions), the trimmed, cleaned, and shuffled fasta file is built (8871.5.114481.CGTACG_trim_clean.PE.fa) and read for the assembly stage, and all read files have been compressed with gzip.  This process can take more than an hour to complete.

Step 4: Sequence Assembly

You are now ready for assembly.  It is common to use idba_ud for metagenome assemblies, but this is not a requirement - there are other assemblers out there.  Idba_ud is the only assembler that includes a scaffolding step by default.

If an assembly is destined for import into ggKbase (and they all should be), there are certain requirements regarding the format of the assembler output:

  • Decide on a minimum length of contigs/scaffolds to be imported, e.g. 500bp, 1000bp, or 2000bp.  Smaller than 1000bp is not recommended because the gene prediction quality goes down.
  • Make sure your contigs are reasonably named. Use only alphanumeric characters and underscores. In order to ensure unique names within ggKbase, contigs/scaffolds need a project specific prefix, e.g. ProjectX_scaffold_1.
  • Every contig needs a read count or coverage value.  This can be done by read mapping the reads that were used in the assembly back to the output of the assembler.
    • Determine per-contig/scaffold read count using bowtie2 mapping (see Step 5)
    • Add the read count value to each contig/scaffold header line

IDBA_UD Assembly:
Default settings in idba_ud are normally fine for metagenome data sets. It is a good idea to use the "--pre_correction" option to normalize highly-represented kmers in the kmer graph. If you're assembling on the Banfield cluster, remember that each node only has 256Gb of system memory. This will limit the size of the assembly you can attempt, since idba_ud has a substantial memory requirement.  (For assemblies with significant numbers of reads (>100 million), contact BCT about running the assembly on a bigger server.)  Note that the number of CPU cores to use during the run is limited to 6 on biotite. On other machines, you should adjust accordingly.

A typical idab_ud run, using all the reads for a sample, might look like the following...

> idba_ud --pre_correction -r 8871.5.114481.CGTACG_trim_clean.PE.fa -o idba_ud_full

If you are running a subassembly (i.e. using only a subset of the reads for a sample), name the output directory accordingly...

> idba_ud --pre_correction -r 8871.5.114481.CGTACG_trim_clean.PE.sub10.fa -o idba_ud_sub10

After idba_ud has finished, your output will look something like this...

> ls
align-100  begin          contig-80.fa  graph-40.fa         local-contig-40.fa
align-20   contig-100.fa  contig.fa     graph-60.fa         local-contig-60.fa
align-40   contig-20.fa   end           graph-80.fa         local-contig-80.fa
align-60   contig-40.fa   graph-100.fa  kmer                log
align-80   contig-60.fa   graph-20.fa   local-contig-20.fa  scaffold.fa

Most of this output can (and should) be removed. Remove the unnecessary files with this command:

rm kmer contig-* align-* graph-* local-*

   Data Storage Tip: This will free up a lot of disk space.

Now you need to customize the headers on the scaffold.fa output from idba_ud. By default it will have scaffold names like "scaffold_1" etc.  We need to change this to something sample/project specific. For example, "RBG42_scaffold", or "Gut1_time4_scaffold".  A fast way to change every header in a fasta file is by using the unix sed command...

sed 's/scaffold/RBG42_scaffold/' scaffold.fa > RBG42_scaffold.fa

Note how this customizes each header line, but also create an appropriately-named output file too. All further steps will rely on this naming scheme.

Step 5: Read Mapping for Coverage Calculation

Each contig/scaffold output from the assembly needs a read count appended to the header line as well as a read length.  The normal convention for ggKbase import is to add this information on the header line of each contig/scaffold from the assembly...

>CODE_scaffold_32 read_length_150 read_count_13456

To calculate the read count, you can use bowtie2 with the default settings.  This will map every read to its location on a contig (if it was used in the assembly).  The output of bowtie2 will be a SAM file. This file shows the location where each read maps.  You should generate this file for every assembly - it can be useful later as well.

Note that each SAM file will list every read... EVEN IF IT DOESN'T MAP. That means if you have 100 million reads that went into your assembly, you will have a SAM file with at least 100 million lines. You should run all bowtie2-created SAM files through the shrinksam command to remove the unmapping reads and generate a SAM file with only the reads that map to a contig/scaffold in the assembly output. This will save lots of disk space!

Here is how you should run your read mapping:

> mkdir bt2
> bowtie2-build RBG42_scaffold.fa bt2/RBG42_scaffold.fa
> bowtie2 -x bt2/RBG42_scaffold.fa -1 /path/to/RBG42_trim_clean.PE.1.fastq.gz -2 /path/to/RBG42_trim_clean.PE.2.fastq.gz 2> RBG42_scaffold_mapped.log | shrinksam -v > RBG42_scaffold_mapped.sam
> add_read_count.rb RBG42_scaffold_mapped.sam RBG42_scaffold.fa 150 > RBG42_scaffold.fa.counted
> mv RBG42_scaffold.fa.counted RBG42_scaffold.fa

What this does is

  1. create a bt2 dir for storing the index
  2. create the bt2 index
  3. run bowtie2 mapping using the forward and reverse reads for the sample
  4. add_read_count.rb creates a new fasta file with modified headers and this example assumes 150bp read length - output is redirected into a temporary file
  5. replace the original scaffold file with the header-modified temporary file.

When your task is complete, you can generate some useful statistics on your assembly output using the script on biotite. It will tell you the assembly details, like N50, max contig length etc. and give you a nice size breakdown of your contigs/scaffolds.  You can run it like so...

> -i RBG42_scaffold.fa

It will generate a file called "RBG42_scaffold.fa.summary.txt". At this point, you're ready for annotating your assembly.

(Note: on biotite will run the assembly, rename the output scaffolds, run the bowtie2 mapping and add the read count header information. You can use this script instead of steps 4 and 5 above.)

Step 4/5 alternate path using

The wrapper script, can be used to complete steps 4 and 5 with one command.

> -h
usage: [-h] -b BASENAME [-l READ_LENGTH] [-m MINLENGTH]
                    [-p PROCESSORS] [-n] [-q] [-v]
                    reads_file fwd rev

Wraps running IDBA_UD and pre-, post-processing steps

positional arguments:
  reads_file            shuffle reads in fasta
  fwd                   forward reads for mapping in fastq
  rev                   reverse reads for mapping in fastq

optional arguments:
  -h, --help            show this help message and exit
  -b BASENAME, --basename BASENAME
                        name to use as file base
  -l READ_LENGTH, --read_length READ_LENGTH
                        read length (default: 150)
  -m MINLENGTH, --minlength MINLENGTH
                        contig minimum length cutoff (default: 1000)
  -p PROCESSORS, --processors PROCESSORS
                        number of processors to use (default: 6)
  -n, --dry-run         Don't actually do anything
  -q, --queue           Submit job to queue (default: False)
  -v, --verbose

To run this script, you must give four key parameters:

  1. a base name for the project; this will be your project name most likely
  2. the fasta file containing the shuffled reads
  3. a fastq file containing just the forward reads (can be gzip'd)
  4. a fastq file containing just the reverse reads (can be gzip'd)

Additionally, if you are running this assembly locally on biotite (e.g. if you need the extra memory present on biotite), your assembly will be limited to 6 cores. If you give the "-q" argument, you will have access to more cores (48 max), but you will have less memory (each  node has 256Gb of system memory).  Therefore, if you have a lot of reads, e.g. greater than about 100 million,  you should run on biotite and contact BCT for advice.  If you just want to see the commands that this script will run, but not actually run them, use the "--dry-run" parameter.  An example run would look as follows...

> 31_003.PE.fa 31_003.PE.1.fastq.gz 31_003.PE.2.fastq.gz -b 31test -p 48 -q

When the process is complete, the output directory will look like this...

> ls 31test_idba_ud/
31test_mapped.sam           begin                       end                         scaffold.fa
31test_scaffold.fa          bt2/                        log                         
31test_scaffold_min1000.fa  contig.fa                   mapped.log

You're now ready to continue on with gene prediction.

   Data Storage Tip: Keep begin, end, log/mapped.log, and min1000.fa files while removing others.

Step 6: Gene and Small RNA Prediction

NOTE: Steps 6 and 7 will eventually be contained in master script.  For now they must be run manually

For annotating the assembly output, we need to predict genes, small RNAs, and then functionally annotate the predicted open reading frames.  We use the Prodigal for gene prediction (and we are going to be adding more soon).  Predicting small RNAs uses an in-house system developed by Chris Brown to find 16s rRNAs and we use tRNAscan-SE for prediction of tRNA genes.

Before we get started with predicting genes, remember that we need to size filter our contigs/scaffolds. By default, we use a minimum size cutoff of 1000bp.  Filtering your contigs/scaffolds is easy using pullseq:

> pullseq -i RBG42_scaffold.fa -m 1000 > RBG42_scaffold_min1000.fa

You should follow this naming scheme (i.e. "*_min1000.fa" for everything larger than 1000bp).

NOTE: if you used above to create  your assembly, then it will have already created the 1000bp minimum file for you.

Running prodigal uses the size-filtered contigs/scaffolds and is done as follows:

> prodigal -i RBG42_scaffold_min1000.fa -o RBG42_scaffold_min1000.fa.genes -a RBG42_scaffold_min1000.fa.genes.faa -d RBG42_scaffold_min1000.fa.genes.fna -m -p meta

This will prevent prodigal from making predictions that span "N"-containing stretches and will generate 3 output files: the gene predictions in GFF format; the corresponding protein sequences; the corresponding DNA sequences.

Finding the 16S rRNA genes on the contigs/scaffolds is done using the shell wrapper script which runs Chris Brown's script using the default settings and the database.  This shell script is run as follows:

> RBG42_scaffold_min1000.fa > RBG42_scaffold_min1000.fa.16s

You should follow this naming scheme.

tRNA genes can be predicted using the tRNAscanSE wrapper script:

> trnascan_pusher.rb -i RBG42_scaffold_min1000.fa > /dev/null 2>&1

(The reason for the redirection to /dev/null is that the tRNAscanSE perl script causes a ton of warnings from the perl interpreter due to coding style.  These don't affect the output, but are obnoxious nonetheless. Redirecting to /dev/null will hide them from your terminal window.)

Step 7: Annotation

We annotate the predicted proteins by doing similarity searches using usearch, comparing each protein sequence against the following databases:

  • KEGG (curated database with excellent metabolic pathways metadata)
  • UniRef100 (curated dataset derived from UniProt; provides functional and taxonomic information)
  • UniProt (a comprehensive, non-redundant database derived from numerous sources)

For KEGG and UniRef100, we do a forward we occasionally do a reverse search to identify reciprocal best hits.  UniProt searches are only done in the forward direction (due to the size of the database).  Note reverse searches are discouraged due to the large resource requirement.

For some datasets, we also run Motif searches using iprscan. It is not part of the standard annotation pipeline because of long runtime.

Annotation searches can be done on the cluster using the cluster_usearch_wrev.rb script on biotite. This will create a usearch database of the proteins (if doing a reverse search), and submit 3 jobs per sample to the cluster queue (one for each database).  It may take a while to complete, depending on the size of you input database and the status of the queue.  You can monitor if you have any jobs still running by typing qstat. Below is a typical set of commands to run the one-way search against the KEGG, UniRef100 databases, and UniProt:

> cluster_usearch_wrev.rb -i RBG42_scaffold_min1000.fa.genes.faa -k -d kegg
> cluster_usearch_wrev.rb -i RBG42_scaffold_min1000.fa.genes.faa -k -d uni
> cluster_usearch_wrev.rb -i RBG42_scaffold_min1000.fa.genes.faa -k -d uniprot

(Note we no longer require reverse blast data for import)

When all your jobs are complete and off the cluster, you will know for two reasons: 1) the cluster queue doesn't have any jobs owned by you still running or queued (type qstat to see your jobs on the cluster); and 2) in the output directory, there are no *.fix intermediate files present. After this, you should run the following command to gzip your annotation files and combine them if you have used the split setting in the previous step:

> clean.rb RBG42_scaffold_min1000

After cleaning, each output b6 file should be converted into a "b6+" file. A "b6+" file contains the actual annotation text from the database hit.  This can be created using the following command:

> demo.fa.genes.faa-vs-kegg.b6.gz kegg > demo.fa.genes.faa-vs-kegg.b6+
> demo.fa.genes.faa-vs-uni.b6.gz uniref > demo.fa.genes.faa-vs-uni.b6+
> demo.fa.genes.faa-vs-uniprot.b6.gz uniprot > demo.fa.genes.faa-vs-uniprot.b6+

Note that will read a gzip file (like demo.fa.genes.faa-vs-uni.b6.gz), but the output will be uncompressed. (You can always re-compress it.)

In summary, after you have gone through all of the above steps, you should have a directory that looks similar to the following:

> ls

Note again the naming conventions used: using anything else but this convention will likely prevent the next step, ggKbase import.

***clean.rb didn't work for you? Did you follow the initial instructions to keep each sample in its own subdirectory within raw.d?

Future Updates:

  • better gene prediction
  • better small RNA prediction
  • cluster integration for full workflow
  • tied to ggKbase web interface

now on to...

Data Import