Data Preparation – Metagenome

This post explains what needs to take place before data gets imported into ggKbase.

Step 1: Read File Organization

Once the sample has been sequenced, you need to download and rename the read files for processing & QC.

Download: check with Rohan about where to download the files. A specific directory within /groups/banfield/sequences/2020 must be created for each sample using the unique project name you wish to use in ggKbase. Reads are downloaded into a subdirectory called "raw.d".  Similarly, there will be a directory for the assembled data called "assembly.d" but we'll discuss that later.

Renaming reads from JGI:

If the data comes from JGI, it will usually already have been shuffled so forward/reverse reads are present in the same file. Rename the read file using the unique ggKbase project name as the BASE_NAME:

> mv CZBC.6181.4.39404.fastq.gz BASE_NAME.fastq.gz

Renaming reads from other facilities:

By default, the Illumina platform will generate "chunks" of reads with 4 million reads per file. Your files may look something like this...

> ls
JBBB016D_TGACCA_L001_R1_001.fastq.gz  JBBB016D_TGACCA_L001_R2_001.fastq.gz
JBBB016D_TGACCA_L001_R1_002.fastq.gz  JBBB016D_TGACCA_L001_R2_002.fastq.gz
JBBB016D_TGACCA_L001_R1_003.fastq.gz  JBBB016D_TGACCA_L001_R2_003.fastq.gz
JBBB016D_TGACCA_L001_R1_004.fastq.gz  JBBB016D_TGACCA_L001_R2_004.fastq.gz

To rename these files, use the rename command:

> rename 's/JBBB016D_TGACCA_L001/BASE_NAME/' *

Step 2: Read Processing & QC

Assessing the quality of your reads:

Use the fastqc command and visualize the html file to assess the quality of your reads both before and after processing. For example:

> fastqc -o output_dir -f fastq BASE_NAME.fastq.gz 

Read quality control involves 3 steps: removing Illumina adapters (BBTools), removing PhiX and other Illumina trace contaminants (BBTools), and sequence quality trimming (Sickle).  After this, a fasta-formatted file containing the remaining reads is needed for the assembly step.

For now, all read processing and QC will be run by Rohan. Eventually, a snakemake workflow will handle this and assembly all in one step.

Step 3: Sequence Assembly

Steps 3 & 4 can be combined, see Step 3/4 alternate path below

For now, assembly and following steps can be run out of your directory in /groups/banfield/users/ and final files for import will be moved by Rohan to assembly.d in the project's directory within /group/banfield/sequences/2020/ 

It is common to use idba_ud for metagenome assemblies, but this is not a requirement - there are other assemblers out there.  Idba_ud is the only assembler that includes a scaffolding step by default.

Regardless of which assembler is used, the final scaffold file for ggKbase import must meet these requirements:

  • Contain only scaffolds 1000bp or longer
  • Scaffolds should be named numerically using the following format, where BASE_NAME is a unique and descriptive project identifier: BASE_NAME_scaffold_1.
  • Every scaffold header needs a read length and read count value (see Step 4)

IDBA_UD Assembly:
Default settings in idba_ud are normally fine for metagenome data sets. It is a good idea to use the "--pre_correction" option to normalize highly-represented kmers in the kmer graph. For assemblies with significant numbers of reads (>100 million), contact Rohan about running the assembly on the high memory cluster node. Use 48 threads on the cluster and up to 16 threads on biotite.

A typical idba_ud run:

> idba_ud --pre_correction -r BASE_NAME_trim_clean.PE.fa -o BASE_NAME_idba_ud

After idba_ud has finished, your output will look something like this...

> ls
align-100  begin          contig-80.fa  graph-40.fa         local-contig-40.fa
align-20   contig-100.fa  contig.fa     graph-60.fa         local-contig-60.fa
align-40   contig-20.fa   end           graph-80.fa         local-contig-80.fa
align-60   contig-40.fa   graph-100.fa  kmer                log
align-80   contig-60.fa   graph-20.fa   local-contig-20.fa  scaffold.fa

Most of this output can (and should) be removed. Remove the unnecessary files with this command:

> rm kmer contig-* align-* graph-* local-*

Now you need to customize the headers on the scaffold.fa output from idba_ud. By default it will have scaffold names like "scaffold_1" etc.  We need to change this to something sample/project specific: BASE_NAME_scaffold_1.  A fast way to change every header in a fasta file is by using sed:

> sed 's/scaffold/BASE_NAME_scaffold/' scaffold.fa > BASE_NAME_scaffold.fa

Note how this customizes each header line, but also create an appropriately-named output file too. All further steps will rely on this naming scheme.

Step 4: Read Mapping for Coverage Calculation

Each scaffold output from the assembly needs a read length and read count appended to the header line.  For example:

>BASE_NAME_scaffold_32 read_length_150 read_count_13456

To calculate the read count, you can use bowtie2 with the default settings. This will map every read to its location on a scaffold.  The output of bowtie2 will be a SAM file. This file shows the location where each read maps.  You should generate this file for every assembly - it can be useful later as well.

Note that each SAM file will list every read... EVEN IF IT DOESN'T MAP. That means if you have 100 million reads that went into your assembly, you will have a SAM file with at least 100 million lines. You should run all bowtie2-created SAM files through the shrinksam command to remove the unmapping reads and generate a SAM file with only the reads that map to a scaffold in the assembly output. Here is how you should run your read mapping:

> mkdir bt2
> bowtie2-build BASE_NAME_scaffold.fa bt2/BASE_NAME_scaffold.fa
> bowtie2 -x bt2/BASE_NAME_scaffold.fa -1 /PATH/TO/BASE_NAME_trim_clean.PE.1.fastq.gz -2 /PATH/TO/BASE_NAME_trim_clean.PE.2.fastq.gz 2> BASE_NAME_scaffold_mapped.log | shrinksam -v > BASE_NAME_scaffold_mapped.sam
> /groups/banfield/software/pipeline/v1.1/scripts/add_read_count.rb BASE_NAME_scaffold_mapped.sam BASE_NAME_scaffold.fa 150 > BASE_NAME_scaffold.fa.counted
> mv BASE_NAME_scaffold.fa.counted BASE_NAME_scaffold.fa

What this does is:

  1. create a bt2 dir for storing the index
  2. create the bt2 index
  3. run bowtie2 mapping using the forward and reverse reads for the sample
  4. add_read_count.rb creates a new fasta file with modified headers and this example assumes 150bp read length - output is redirected into a temporary file. add_read_count.rb can only read SAM files (not BAM).
  5. replace the original scaffold file with the header-modified temporary file.

**Either delete the SAM file once this process is complete or if you anticipate using it for something else in the near future (e.g. automated binning), convert to a BAM file to save disc space:

> cat BASE_NAME_scaffold_mapped.sam | sambam > BASE_NAME_scaffold_mapped.bam
> rm BASE_NAME_scaffold_mapped.sam

You can generate some useful statistics on your assembly output using the contig_stats.pl script. It will tell you the assembly details, like N50, max contig length etc. and give you a nice size breakdown of your contigs/scaffolds.  You can run it like so:

> contig_stats.pl -i BASE_NAME_scaffold.fa

Step 3/4 alternate path using myIDBA_UD.py

For now, assembly and following steps can be run out of your directory in /groups/banfield/users/ and final files for import will be moved by Rohan to assembly.d in the project's directory within /group/banfield/sequences/2020/ 

The wrapper script, myIDBA_UD.py can be used to complete steps 3 & 4 with one command.

> /groups/banfield/software/pipeline/v1.1/scripts/myIDBA_UD.py -h
usage: myIDBA_UD.py [-h] -b BASENAME [-l READ_LENGTH] [-m MINLENGTH]
                    [-p PROCESSORS] [-n] [-q] [-v]
                    reads_file fwd rev

Wraps running IDBA_UD and pre-, post-processing steps

positional arguments:
  reads_file            shuffle reads in fasta
  fwd                   forward reads for mapping in fastq
  rev                   reverse reads for mapping in fastq

optional arguments:
  -h, --help            show this help message and exit
  -b BASENAME, --basename BASENAME
                        name to use as file base
  -l READ_LENGTH, --read_length READ_LENGTH
                        read length (default: 150)
  -m MINLENGTH, --minlength MINLENGTH
                        contig minimum length cutoff (default: 1000)
  -p PROCESSORS, --processors PROCESSORS
                        number of processors to use (default: 6)
  -n, --dry-run         Don't actually do anything
  -q, --queue           Submit job to queue (default: False)
  -v, --verbose

To run this script, you must give four key parameters:

  1. The BASE_NAME for the project
  2. The fasta file containing the shuffled reads
  3. A fastq file containing just the forward reads (can be gzip'd)
  4. A fastq file containing just the reverse reads (can be gzip'd)

To run on the cluster:

> /groups/banfield/software/pipeline/v1.1/scripts/myIDBA_UD.py BASE_NAME.PE.fa BASE_NAME.PE.1.fastq.gz BASE_NAME.PE.2.fastq.gz -b BASE_NAME -p 48 -q

If you want to run on biotite, remove the "-q" flag and change "-p" (processors) to a max of 16. If you have >100 million reads, talk to Rohan about assembling on the high memory cluster node. If you just want to see the commands that this script will run, but not actually run them, use the "--dry-run" parameter.

When the process is complete, the output directory will look like this...

> ls BASE_NAME_idba_ud/
BASE_NAME_mapped.sam           begin                       end                         scaffold.fa
BASE_NAME_scaffold.fa          bt2/                        log                         
BASE_NAME_scaffold_min1000.fa  contig.fa                   mapped.log

**Either delete the SAM file once this process is complete or if you anticipate using it for something else in the near future (e.g. automated binning), convert to a BAM file to save disc space:

> cat BASE_NAME_scaffold_mapped.sam | sambam > BASE_NAME_scaffold_mapped.bam
> rm BASE_NAME_scaffold_mapped.sam

Step 5: Gene and Small RNA Prediction

For annotating the assembly output, we need to predict genes, small RNAs, and then functionally annotate the predicted open reading frames.  We use  Prodigal for gene prediction.

Before we get started with predicting genes, remember that we need to size filter our scaffolds. By default, we use a minimum size cutoff of 1000bp. If you used myIDBA_UD.py above to create  your assembly, then it will have already created the 1000bp minimum file for you.

Filtering your scaffolds is easy using pullseq. You must follow this naming scheme (i.e. "*_min1000.fa"):

> pullseq -i BASE_NAME_scaffold.fa -m 1000 > BASE_NAME_scaffold_min1000.fa

Running prodigal using the size-filtered scaffolds:

> prodigal -i BASE_NAME_scaffold_min1000.fa -o BASE_NAME_scaffold_min1000.fa.genes -a BASE_NAME_scaffold_min1000.fa.genes.faa -d BASE_NAME_scaffold_min1000.fa.genes.fna -m -p meta

The "-m" flag will prevent prodigal from making predictions that span "N"-containing stretches. This command will generate 3 output files: the gene predictions in GFF format; the corresponding protein sequences; the corresponding DNA sequences.

Finding the 16S rRNA genes:

Using the 16s.sh shell wrapper script which runs Chris Brown's 16SfromHMM.py script with the default settings and the ssu-align-0p1.1.cm database.  This shell script is run as follows (you must follow this naming scheme:

> /groups/banfield/software/pipeline/v1.1/scripts/16s.sh BASE_NAME_scaffold_min1000.fa > BASE_NAME_scaffold_min1000.fa.16s

Predicting tRNA genes using the tRNAscanSE wrapper script:

> /groups/banfield/software/pipeline/v1.1/scripts/trnascan_pusher.rb -i BASE_NAME_scaffold_min1000.fa > /dev/null 2>&1

(The reason for the redirection to /dev/null is that the tRNAscanSE perl script causes a ton of warnings from the perl interpreter due to coding style.  These don't affect the output, but are obnoxious nonetheless. Redirecting to /dev/null will hide them from your terminal window.)

Step 6: Annotation

We annotate the predicted proteins by doing similarity searches using usearch, comparing each protein sequence against the following databases:

  • KEGG (curated database with excellent metabolic pathways metadata)
  • UniRef100 (curated dataset derived from UniProt; provides functional and taxonomic information)
  • UniProt (a comprehensive, non-redundant database derived from numerous sources)

Annotation searches can be done on the cluster using the cluster_usearch_wrev.rb script. You will need to submit 3 jobs per sample to the cluster queue (one for each database).  It may take a while to complete, depending on the size of you input database and the status of the queue. You may only submit 30 jobs at a time to the cluster. This is how they should be run:

> sbatch --wrap "/groups/banfield/software/pipeline/v1.1/scripts/cluster_usearch_wrev.rb -i /FULL/PATH/TO/BASE_NAME_scaffold_min1000.fa.genes.faa -k -d kegg --nocluster"
> sbatch --wrap "/groups/banfield/software/pipeline/v1.1/scripts/cluster_usearch_wrev.rb -i /FULL/PATH/TO/BASE_NAME_scaffold_min1000.fa.genes.faa -k -d uni --nocluster"
> sbatch --wrap "/groups/banfield/software/pipeline/v1.1/scripts/cluster_usearch_wrev.rb -i /FULL/PATH/TO/BASE_NAME_scaffold_min1000.fa.genes.faa -k -d uniprot --nocluster"

When all your jobs are complete and off the cluster, you will know for two reasons:

  1. The cluster queue doesn't have any jobs owned by you still running or queued (type squeue to see your jobs on the cluster)
  2. In the output directory, there are no *.fix intermediate files present.

After this, gzip your annotation files:

> gzip *.b6

Now each output b6 file is converted into a "b6+" file, which contains the actual annotation text from the database hit:

> /shared/software/bin/annolookup.py BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6.gz kegg > BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6+
> /shared/software/bin/annolookup.py BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6.gz uniref > BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6+
> /shared/software/bin/annolookup.py BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6.gz uniprot > BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6+

Note that annolookup.py reads a gzip'd file, but the output should be uncompressed for import.

Summary

After you have gone through all of the above steps, you should have a directory that looks similar to the following:

> ls
begin
bt2/
BASE_NAME_scaffold.fa
BASE_NAME_scaffold_min1000.16S.cmsearch
BASE_NAME_scaffold_min1000.fa
BASE_NAME_scaffold_min1000.fa.16s
BASE_NAME_scaffold_min1000.fa.genes
BASE_NAME_scaffold_min1000.fa.genes.faa
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6+
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6.gz
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6+
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6.gz
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6+
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6.gz
BASE_NAME_scaffold_min1000.fa.genes.fna
BASE_NAME_scaffold_min1000.fa.trnascan
BASE_NAME_scaffold_min1000.fa.trnascan.fasta
contig.fa
end
log
mapped.log
scaffold.fa

**Note again the naming conventions used: using anything else but this convention will likely prevent the next step, ggKbase import...

Data Import