This post explains what needs to take place before data gets imported into ggKbase.
Step 1: Sample/Project Creation
A project in ggKbase is directly related to a sample taken in the field etc. Projects are further organized into Project Groups, which roughly represent a study or site. For example, you have many sediment samples from a location and these samples can be different conditions, and/or they can be technical replicates etc. During project creation on the ggkbase website (http://ggkbase.berkeley.edu/projects/new), you will be asked to select an existing Project Group or to define a new one.
*Project creation is available* You will need the following information to complete a new project.
- Project name: corresponds to a sample name; can be free form but should be kept short
- Project slug: part of the URL; used to find your project on using a web link; should be short and similar to your project name, but only containing A-Za-z0-9_- (i.e. no spaces or other non-web characters)
- Project type: can be either "binning" or "analysis"
- Project Group: represents the overall group this project belongs in; can either exist or can be created at this time; if this is a new project group, you'll need to define the ecosystem
- Sequencing Facility: short description of sequencing center or facility where the sample was sequenced
- Location: where was the sample taken; use existing location, or enter the closest city to where sample was taken
- Date of sample; when was this sample taken?
- Read file(s): location on our server where the read or reads are located
- Read processing: how were the reads processed; short description
- Read length: length of one read, pre-trimming
- Total read bp: number of bp of reads going into the assembly
- Total assembled sequence: bp of sequence that resulted after assembly
- Assembly type: what did you use to assemble the reads; e.g. "IDBA_UD"
- Path to assembly: important field with the location on our server of the output assembly and annotation files, as well as the file basename of all the generated output
- Members: other members of ggkbase that should be added to the project on import
Step 2: Read File Processing
Once the sample has been sequenced, you need to download and process the read files. If the data comes from JGI, it will usually already have been shuffled so forward/reverse reads are present in the same file. If not from JGI, the data are usually still separate, and in fact, they are usually still in non-concatenated chunks.
For the downloading step, be sure to check with BCT about where to download the files: a specific directory (or directories) must be created for the projects. Reads are downloaded into a directory called "raw.d" and you should follow this model. Similarly, there will be a directory for the assembled data called "assembly.d" but we'll discuss that later. In raw.d, create subdirectories for each sample.
Processing reads are from JGI
First, uncompress and unshuffle the reads file using fastq_peel.sh...
fastq_peel.sh -h splits a paired-end fastq file into \1 and \2 reads Usage: fastq_peel.sh paired_reads.fastq output_prefix paired_reads.fastq may be any list of files to be sent to cat, if contained in quotes and separated by spaces (input can be gzip compressed - filename must end in .gz
For example, if your reads file is called "CZBC.6181.4.39404.fastq.gz", you would do this...
> gunzip CZBC.6181.4.39404.fastq.gz > fastq_peel.sh CZBC.6181.4.39404.fastq CZBC.6181.4.39404
This will create two new files...
> ls CZBC.6181.4.39404.1.fastq CZBC.6181.4.39404.2.fastq
At this point, the original read file can be removed. Now proceed with the remaining steps below.
Processing reads from other facilities
By default, the Illumina platform will generate "chunks" of reads with 4 million reads per file. Your files may look something like this...
> ls JBBB016D_TGACCA_L001_R1_001.fastq.gz JBBB016D_TGACCA_L001_R2_001.fastq.gz JBBB016D_TGACCA_L001_R1_002.fastq.gz JBBB016D_TGACCA_L001_R2_002.fastq.gz JBBB016D_TGACCA_L001_R1_003.fastq.gz JBBB016D_TGACCA_L001_R2_003.fastq.gz JBBB016D_TGACCA_L001_R1_004.fastq.gz JBBB016D_TGACCA_L001_R2_004.fastq.gz
To process these files, they need to be uncompressed (for later) and combined. This can be done using the zcat command:
# process forward (R1) reads > cat JBBB016D_TGACCA_L001_R1_001.fastq.gz JBBB016D_TGACCA_L001_R1_002.fastq.gz JBBB016D_TGACCA_L001_R1_003.fastq.gz JBBB016D_TGACCA_L001_R1_004.fastq.gz > JBBB016D_TGACCA_L001_R1.fastq.gz # now process reverse (R2) reads > cat JBBB016D_TGACCA_L001_R2_001.fastq.gz JBBB016D_TGACCA_L001_R2_002.fastq.gz JBBB016D_TGACCA_L001_R2_003.fastq.gz JBBB016D_TGACCA_L001_R2_004.fastq.gz > JBBB016D_TGACCA_L001_R2.fastq.gz
Now you need to link the read file information to your ggkbase project (link) for each sample/project. Upon doing this, you're ready to QC your reads.
Rename your files so that they look like this (matching basename with .1. and .2. for forward and reverse reads):
Processing reads containing human DNA "contaminants"
If your reads are generated from human microbiome samples, they need to be screened to remove any human DNA fragments. This can be done with process_reads_bbmap.rb using the -m and -d options.
Assessing the Quality of Your Reads
Use the fastqc command in biotite and visualize the html file to assess the quality of your reads both before and after process_reads_bbmap.rb. For example:
/opt/bin/bio/FastQC/fastqc -o output_dir -f fastq seqfile.1.fastq.gz seqfile.2.fastq.gz
Step 3: Read QC
Read quality control involves 3 steps: removing Illumina adapters (BBTools), removing PhiX and other Illumina trace contaminants (BBTools), and sequence quality trimming (Sickle). After this, a fasta-formatted file containing the remaining reads is needed for the assembly step.
All of these steps can be done using the process_reads_bbmap.rb script on biotite.berkeley.edu:
process_reads_bbmap.rb -h Summary: process_reads_bbmap.rb: Processes JGI reads by trimming adapters and filtering for PhiX/Illumina contaminants and ends with sickle quality trimming the reads; can also create a fasta-formatted, shuffled output ready for assembly Synopsis: process_reads_bbmap.rb --basename=<reads base file name; can be shuffled or separate> Options: -b, --basename=<s> read file(s) base name -p, --processors=<i> number of CPU cores to use (default: 6) -r, --run-sickle, --no-run-sickle run sickle quality trimmer (default: true) -k, --sickle-flag=<s> if running sickle, set type flag (default: sanger) -c, --create-fa create fasta file for assembly -m, --mask-human remove reads mapping to humans (only for microbiomes) -d, --humandb=<s> database to use for human masking (default: ) -s, --shuffled input is already shuffled -z, --zipped input is zipped -d, --dry-run Don't actually run anything, just report -v, --verbose show extra progress information -e, --version Print version and exit -h, --help Show this message
Usually default settings are fine, but be aware that BBTools and Sickle have lots of customizable parameters. An example processing JGI read files is below. Note that if you have BOTH the unshuffled JGI downloaded file, and the output from fastq_peel.sh step above, process_reads_bbmap.rb will find the unshuffled file and process it. You should therefore delete the original file downloaded from JGI after you have it un-shuffled...
> ls 8871.5.114481.CGTACG.1.fastq.gz 8871.5.114481.CGTACG.2.fastq.gz > process_reads_bbmap.rb --basename=8871.5.114481.CGTACG -c
Note that this script will determine many of the options for you (e.g. zipped or not, shuffled or not, etc). The dry-run option will show you exactly what would be run, without actually running the programs. After running this program, your directory should look like this...
> ls 8871.5.114481.CGTACG.1.fastq.gz 8871.5.114481.CGTACG_trim_clean.2.fastq.gz 8871.5.114481.CGTACG.2.fastq.gz 8871.5.114481.CGTACG_trim_clean.PE.1.fastq.gz 8871.5.114481.CGTACG_trim.1.fastq.gz 8871.5.114481.CGTACG_trim_clean.PE.2.fastq.gz 8871.5.114481.CGTACG_trim.2.fastq.gz 8871.5.114481.CGTACG_trim_clean.PE.fa 8871.5.114481.CGTACG_trim_clean.1.fastq.gz 8871.5.114481.CGTACG_trim_clean.SR.fastq.gz
Note that the original file is gone (replaced by the unshuffled versions), the trimmed, cleaned, and shuffled fasta file is built (8871.5.114481.CGTACG_trim_clean.PE.fa) and read for the assembly stage, and all read files have been compressed with gzip. This process can take more than an hour to complete.
Step 4: Sequence Assembly
You are now ready for assembly. It is common to use idba_ud for metagenome assemblies, but this is not a requirement - there are other assemblers out there. Idba_ud is the only assembler that includes a scaffolding step by default.
If an assembly is destined for import into ggKbase (and they all should be), there are certain requirements regarding the format of the assembler output:
- Decide on a minimum length of contigs/scaffolds to be imported, e.g. 500bp, 1000bp, or 2000bp. Smaller than 1000bp is not recommended because the gene prediction quality goes down.
- Make sure your contigs are reasonably named. Use only alphanumeric characters and underscores. In order to ensure unique names within ggKbase, contigs/scaffolds need a project specific prefix, e.g. ProjectX_scaffold_1.
- Every contig needs a read count or coverage value. This can be done by read mapping the reads that were used in the assembly back to the output of the assembler.
- Determine per-contig/scaffold read count using bowtie2 mapping (see Step 5)
- Add the read count value to each contig/scaffold header line
Default settings in idba_ud are normally fine for metagenome data sets. It is a good idea to use the "--pre_correction" option to normalize highly-represented kmers in the kmer graph. If you're assembling on the Banfield cluster, remember that each node only has 256Gb of system memory. This will limit the size of the assembly you can attempt, since idba_ud has a substantial memory requirement. (For assemblies with significant numbers of reads (>100 million), contact BCT about running the assembly on a bigger server.) Note that the number of CPU cores to use during the run is limited to 6 on biotite. On other machines, you should adjust accordingly.
A typical idab_ud run, using all the reads for a sample, might look like the following...
> idba_ud --pre_correction -r 8871.5.114481.CGTACG_trim_clean.PE.fa -o idba_ud_full
If you are running a subassembly (i.e. using only a subset of the reads for a sample), name the output directory accordingly...
> idba_ud --pre_correction -r 8871.5.114481.CGTACG_trim_clean.PE.sub10.fa -o idba_ud_sub10
After idba_ud has finished, your output will look something like this...
> ls align-100 begin contig-80.fa graph-40.fa local-contig-40.fa align-20 contig-100.fa contig.fa graph-60.fa local-contig-60.fa align-40 contig-20.fa end graph-80.fa local-contig-80.fa align-60 contig-40.fa graph-100.fa kmer log align-80 contig-60.fa graph-20.fa local-contig-20.fa scaffold.fa
Most of this output can (and should) be removed. Remove the unnecessary files with this command:
rm kmer contig-* align-* graph-* local-*
Now you need to customize the headers on the scaffold.fa output from idba_ud. By default it will have scaffold names like "scaffold_1" etc. We need to change this to something sample/project specific. For example, "RBG42_scaffold", or "Gut1_time4_scaffold". A fast way to change every header in a fasta file is by using the unix sed command...
sed 's/scaffold/RBG42_scaffold/' scaffold.fa > RBG42_scaffold.fa
Note how this customizes each header line, but also create an appropriately-named output file too. All further steps will rely on this naming scheme.
Step 5: Read Mapping for Coverage Calculation
Each contig/scaffold output from the assembly needs a read count appended to the header line as well as a read length. The normal convention for ggKbase import is to add this information on the header line of each contig/scaffold from the assembly...
>CODE_scaffold_32 read_length_150 read_count_13456
To calculate the read count, you can use bowtie2 with the default settings. This will map every read to its location on a contig (if it was used in the assembly). The output of bowtie2 will be a SAM file. This file shows the location where each read maps. You should generate this file for every assembly - it can be useful later as well.
Note that each SAM file will list every read... EVEN IF IT DOESN'T MAP. That means if you have 100 million reads that went into your assembly, you will have a SAM file with at least 100 million lines. You should run all bowtie2-created SAM files through the shrinksam command to remove the unmapping reads and generate a SAM file with only the reads that map to a contig/scaffold in the assembly output. This will save lots of disk space!
Here is how you should run your read mapping:
> mkdir bt2 > bowtie2-build RBG42_scaffold.fa bt2/RBG42_scaffold.fa > bowtie2 -x bt2/RBG42_scaffold.fa -1 /path/to/RBG42_trim_clean.PE.1.fastq.gz -2 /path/to/RBG42_trim_clean.PE.2.fastq.gz 2> RBG42_scaffold_mapped.log | shrinksam -v > RBG42_scaffold_mapped.sam > add_read_count.rb RBG42_scaffold_mapped.sam RBG42_scaffold.fa 150 > RBG42_scaffold.fa.counted > mv RBG42_scaffold.fa.counted RBG42_scaffold.fa
What this does is
- create a bt2 dir for storing the index
- create the bt2 index
- run bowtie2 mapping using the forward and reverse reads for the sample
- add_read_count.rb creates a new fasta file with modified headers and this example assumes 150bp read length - output is redirected into a temporary file
- replace the original scaffold file with the header-modified temporary file.
When your task is complete, you can generate some useful statistics on your assembly output using the contig_stats.pl script on biotite. It will tell you the assembly details, like N50, max contig length etc. and give you a nice size breakdown of your contigs/scaffolds. You can run it like so...
> contig_stats.pl -i RBG42_scaffold.fa
It will generate a file called "RBG42_scaffold.fa.summary.txt". At this point, you're ready for annotating your assembly.
(Note: myIDBA_UD.py on biotite will run the assembly, rename the output scaffolds, run the bowtie2 mapping and add the read count header information. You can use this script instead of steps 4 and 5 above.)
Step 4/5 alternate path using myIDBA_UD.py
The wrapper script, myIDBA_UD.py can be used to complete steps 4 and 5 with one command.
> myIDBA_UD.py -h usage: myIDBA_UD.py [-h] -b BASENAME [-l READ_LENGTH] [-m MINLENGTH] [-p PROCESSORS] [-n] [-q] [-v] reads_file fwd rev Wraps running IDBA_UD and pre-, post-processing steps positional arguments: reads_file shuffle reads in fasta fwd forward reads for mapping in fastq rev reverse reads for mapping in fastq optional arguments: -h, --help show this help message and exit -b BASENAME, --basename BASENAME name to use as file base -l READ_LENGTH, --read_length READ_LENGTH read length (default: 150) -m MINLENGTH, --minlength MINLENGTH contig minimum length cutoff (default: 1000) -p PROCESSORS, --processors PROCESSORS number of processors to use (default: 6) -n, --dry-run Don't actually do anything -q, --queue Submit job to queue (default: False) -v, --verbose
To run this script, you must give four key parameters:
- a base name for the project; this will be your project name most likely
- the fasta file containing the shuffled reads
- a fastq file containing just the forward reads (can be gzip'd)
- a fastq file containing just the reverse reads (can be gzip'd)
Additionally, if you are running this assembly locally on biotite (e.g. if you need the extra memory present on biotite), your assembly will be limited to 6 cores. If you give myIDBA_UD.py the "-q" argument, you will have access to more cores (48 max), but you will have less memory (each node has 256Gb of system memory). Therefore, if you have a lot of reads, e.g. greater than about 100 million, you should run on biotite and contact BCT for advice. If you just want to see the commands that this script will run, but not actually run them, use the "--dry-run" parameter. An example run would look as follows...
> myIDBA_UD.py 31_003.PE.fa 31_003.PE.1.fastq.gz 31_003.PE.2.fastq.gz -b 31test -p 48 -q
When the process is complete, the output directory will look like this...
> ls 31test_idba_ud/ 31test_mapped.sam begin end scaffold.fa 31test_scaffold.fa bt2/ log 31test_scaffold_min1000.fa contig.fa mapped.log
You're now ready to continue on with gene prediction.
Step 6: Gene and Small RNA Prediction
NOTE: Steps 6 and 7 will eventually be contained in master script. For now they must be run manually
For annotating the assembly output, we need to predict genes, small RNAs, and then functionally annotate the predicted open reading frames. We use the Prodigal for gene prediction (and we are going to be adding more soon). Predicting small RNAs uses an in-house system developed by Chris Brown to find 16s rRNAs and we use tRNAscan-SE for prediction of tRNA genes.
Before we get started with predicting genes, remember that we need to size filter our contigs/scaffolds. By default, we use a minimum size cutoff of 1000bp. Filtering your contigs/scaffolds is easy using pullseq:
> pullseq -i RBG42_scaffold.fa -m 1000 > RBG42_scaffold_min1000.fa
You should follow this naming scheme (i.e. "*_min1000.fa" for everything larger than 1000bp).
NOTE: if you used myIDBA_UD.py above to create your assembly, then it will have already created the 1000bp minimum file for you.
Running prodigal uses the size-filtered contigs/scaffolds and is done as follows:
> prodigal -i RBG42_scaffold_min1000.fa -o RBG42_scaffold_min1000.fa.genes -a RBG42_scaffold_min1000.fa.genes.faa -d RBG42_scaffold_min1000.fa.genes.fna -m -p meta
This will prevent prodigal from making predictions that span "N"-containing stretches and will generate 3 output files: the gene predictions in GFF format; the corresponding protein sequences; the corresponding DNA sequences.
Finding the 16S rRNA genes on the contigs/scaffolds is done using the 16s.sh shell wrapper script which runs Chris Brown's 16SfromHMM.py script using the default settings and the ssu-align-0p1.1.cm database. This shell script is run as follows:
> 16s.sh RBG42_scaffold_min1000.fa > RBG42_scaffold_min1000.fa.16s
You should follow this naming scheme.
tRNA genes can be predicted using the tRNAscanSE wrapper script:
> trnascan_pusher.rb -i RBG42_scaffold_min1000.fa > /dev/null 2>&1
(The reason for the redirection to /dev/null is that the tRNAscanSE perl script causes a ton of warnings from the perl interpreter due to coding style. These don't affect the output, but are obnoxious nonetheless. Redirecting to /dev/null will hide them from your terminal window.)
Step 7: Annotation
We annotate the predicted proteins by doing similarity searches using usearch, comparing each protein sequence against the following databases:
- KEGG (curated database with excellent metabolic pathways metadata)
- UniRef100 (curated dataset derived from UniProt; provides functional and taxonomic information)
- UniProt (a comprehensive, non-redundant database derived from numerous sources)
For KEGG and UniRef100, we do a forward we occasionally do a reverse search to identify reciprocal best hits. UniProt searches are only done in the forward direction (due to the size of the database). Note reverse searches are discouraged due to the large resource requirement.
For some datasets, we also run Motif searches using iprscan. It is not part of the standard annotation pipeline because of long runtime.
Annotation searches can be done on the cluster using the cluster_usearch_wrev.rb script on biotite. This will create a usearch database of the proteins (if doing a reverse search), and submit 3 jobs per sample to the cluster queue (one for each database). It may take a while to complete, depending on the size of you input database and the status of the queue. You can monitor if you have any jobs still running by typing
qstat. Below is a typical set of commands to run the one-way search against the KEGG, UniRef100 databases, and UniProt:
> cluster_usearch_wrev.rb -i RBG42_scaffold_min1000.fa.genes.faa -k -d kegg > cluster_usearch_wrev.rb -i RBG42_scaffold_min1000.fa.genes.faa -k -d uni > cluster_usearch_wrev.rb -i RBG42_scaffold_min1000.fa.genes.faa -k -d uniprot
(Note we no longer require reverse blast data for import)
When all your jobs are complete and off the cluster, you will know for two reasons: 1) the cluster queue doesn't have any jobs owned by you still running or queued (type
qstat to see your jobs on the cluster); and 2) in the output directory, there are no *.fix intermediate files present. After this, you should run the following command to gzip your annotation files and combine them if you have used the split setting in the previous step:
> clean.rb RBG42_scaffold_min1000
After cleaning, each output b6 file should be converted into a "b6+" file. A "b6+" file contains the actual annotation text from the database hit. This can be created using the following command:
> annolookup.py demo.fa.genes.faa-vs-kegg.b6.gz kegg > demo.fa.genes.faa-vs-kegg.b6+ > annolookup.py demo.fa.genes.faa-vs-uni.b6.gz uniref > demo.fa.genes.faa-vs-uni.b6+ > annolookup.py demo.fa.genes.faa-vs-uniprot.b6.gz uniprot > demo.fa.genes.faa-vs-uniprot.b6+
Note that annolookup.py will read a gzip file (like demo.fa.genes.faa-vs-uni.b6.gz), but the output will be uncompressed. (You can always re-compress it.)
In summary, after you have gone through all of the above steps, you should have a directory that looks similar to the following:
> ls begin bt2/ CG23_sub10_150_scaffold.fa CG23_sub10_150_scaffold_mapped.sam CG23_sub10_150_scaffold_min1000.16S.cmsearch CG23_sub10_150_scaffold_min1000.fa CG23_sub10_150_scaffold_min1000.fa.16s CG23_sub10_150_scaffold_min1000.fa.genes CG23_sub10_150_scaffold_min1000.fa.genes.faa CG23_sub10_150_scaffold_min1000.fa.genes.faa-vs-kegg.b6+ CG23_sub10_150_scaffold_min1000.fa.genes.faa-vs-kegg.b6.gz CG23_sub10_150_scaffold_min1000.fa.genes.faa-vs-uni.b6+ CG23_sub10_150_scaffold_min1000.fa.genes.faa-vs-uni.b6.gz CG23_sub10_150_scaffold_min1000.fa.genes.faa-vs-uniprot.b6+ CG23_sub10_150_scaffold_min1000.fa.genes.faa-vs-uniprot.b6.gz CG23_sub10_150_scaffold_min1000.fa.genes.fna CG23_sub10_150_scaffold_min1000.fa.trnascan CG23_sub10_150_scaffold_min1000.fa.trnascan.fasta contig.fa end log mapped.log scaffold.fa
Note again the naming conventions used: using anything else but this convention will likely prevent the next step, ggKbase import.
***clean.rb didn't work for you? Did you follow the initial instructions to keep each sample in its own subdirectory within raw.d?
- better gene prediction
- better small RNA prediction
- cluster integration for full workflow
- tied to ggKbase web interface
now on to...