Data Preparation – Metagenome | ggKbase Knowledge Base

This post explains what needs to take place before data gets imported into ggKbase.

Step 1: Read File Organization

Once the sample has been sequenced, you need to slack Rohan and Leylen as a group for download and guidance to rename the read files for processing & QC.

Download:

It’ll be done in a directory with the following path: /groups/banfield/sequences/processing/namelastname_MMDDYYYY/initial_download
The person who downloaded the data and will process it (Leylen for now) will send you a .txt file with the names of the reads. You’ll use that to fill in the forward and reverse reads in the following TSV table (ideally, edit it in the Mac Numbers app and export it as TSV). In the table, you’ll also include the name of each project as it will be uploaded to ggKbase, following the Data Naming Convention guide.
Once processed, the projects will be moved to /groups/banfield/sequences/2025

Reads are downloaded into a subdirectory called “raw.d”. Similarly, there will be a directory for the assembled data called “assembly.d” but we’ll discuss that later.

Step 2: Read Processing & QC

Assessing the quality of your reads:

Use the fastqc command and visualize the html file to assess the quality of your reads both before and after processing. For example:

> fastqc -o output_dir -f fastq BASE_NAME.fastq.gz

Read quality control involves 3 steps: removing Illumina adapters (BBTools), removing PhiX and other Illumina trace contaminants (BBTools), and sequence quality trimming (Sickle). After this, a fasta-formatted file containing the remaining reads is needed for the assembly step.

For now, all read processing and QC will be run by Leylen.

Renaming reads from JGI:

If the data comes from JGI, it will usually already have been shuffled so forward/reverse reads are present in the same file. Rename the read file using the unique ggKbase project name as the BASE_NAME:

> mv CZBC.6181.4.39404.fastq.gz BASE_NAME.fastq.gz

Renaming reads from other facilities:

By default, the Illumina platform will generate “chunks” of reads with 4 million reads per file. Your files may look something like this…

> ls
JBBB016D_TGACCA_L001_R1_001.fastq.gz  JBBB016D_TGACCA_L001_R2_001.fastq.gz
JBBB016D_TGACCA_L001_R1_002.fastq.gz  JBBB016D_TGACCA_L001_R2_002.fastq.gz
JBBB016D_TGACCA_L001_R1_003.fastq.gz  JBBB016D_TGACCA_L001_R2_003.fastq.gz
JBBB016D_TGACCA_L001_R1_004.fastq.gz  JBBB016D_TGACCA_L001_R2_004.fastq.gz

To rename these files, use the rename command:

> rename 's/JBBB016D_TGACCA_L001/BASE_NAME/' *

Step 3: Sequence Assembly

Steps 3 & 4 can be combined, see Step 3/4 alternate path below

Assembly will be done in a project folder under /groups/banfield/projects – ask Rohan to create this folder

It is common to use idba_ud for metagenome assemblies, but this is not a requirement – there are other assemblers out there. Idba_ud is the only assembler that includes a scaffolding step by default.

Regardless of which assembler is used, the final scaffold file for ggKbase import must meet these requirements:

Contain only scaffolds 1000bp or longer
Scaffolds should be named numerically using the following format, where BASE_NAME is a unique and descriptive project identifier: BASE_NAME_1.
Every scaffold header needs a read length and read count value (see Step 4)

IDBA_UD Assembly:
Default settings in idba_ud are normally fine for metagenome data sets. It is a good idea to use the “–pre_correction” option to normalize highly-represented kmers in the kmer graph. For assemblies with significant numbers of reads (>100 million), use the high memory cluster queue. Use 64 threads on the cluster and up to 16 threads on biotite.

A typical idba_ud run:

> idba_ud --pre_correction -r BASE_NAME_trim_clean.PE.fa -o BASE_NAME_idba_ud

After idba_ud has finished, your output will look something like this…

> ls
align-100  begin          contig-80.fa  graph-40.fa         local-contig-40.fa
align-20   contig-100.fa  contig.fa     graph-60.fa         local-contig-60.fa
align-40   contig-20.fa   end           graph-80.fa         local-contig-80.fa
align-60   contig-40.fa   graph-100.fa  kmer                log
align-80   contig-60.fa   graph-20.fa   local-contig-20.fa  scaffold.fa

Most of this output can (and should) be removed. Remove the unnecessary files with this command:

> rm kmer contig-* align-* graph-* local-*

Renaming Headers of scaffold.fa files (VERY IMPORTANT!!!)

Now you need to customize the headers on the scaffold.fa output from idba_ud. By default it will have scaffold names like “scaffold_1” etc. We need to change this to something sample/project specific: BASE_NAME_1. A fast way to change every header in a fasta file is by using sed:

> sed 's/scaffold/BASE_NAME/' scaffold.fa > BASE_NAME_scaffold.fa

Note how this customizes each header line, but also create an appropriately-named output file too. All further steps will rely on this naming scheme.

For non-idba assemblers make sure you are renaming your scaffold names to match the ggkbase standard format for scaffold names. Which is sample/basename proceeded by a a contig number.

For spades and megahit you can use this script:

python3 /home/rohan/dev/pipeline/workflow/scripts/gg_rename_assembly.py -i scaffolds.fasta -o BASE_NAME_scaffold.fa -s BASE_NAME

Step 4: Read Mapping for Coverage Calculation

Each scaffold output from the assembly needs a read length and read count appended to the header line. For example:

>BASE_NAME_32 read_length_150 read_count_13456

To calculate the read count, you can use bowtie2 with the default settings. This will map every read to its location on a scaffold. The output of bowtie2 will be a SAM file. This file shows the location where each read maps. You should generate this file for every assembly – it can be useful later as well.

Note that each SAM file will list every read… EVEN IF IT DOESN’T MAP. That means if you have 100 million reads that went into your assembly, you will have a SAM file with at least 100 million lines. You should run all bowtie2-created SAM files through the shrinksam command to remove the unmapping reads and generate a SAM file with only the reads that map to a scaffold in the assembly output. Run read mapping in a unique directory for each sample. Here is how you should run your read mapping:

> mkdir bt2
> bowtie2-build BASE_NAME_scaffold.fa bt2/BASE_NAME_scaffold.fa
> bowtie2 -x bt2/BASE_NAME_scaffold.fa -1 /PATH/TO/BASE_NAME_trim_clean.PE.1.fastq.gz -2 /PATH/TO/BASE_NAME_trim_clean.PE.2.fastq.gz 2> BASE_NAME_scaffold_mapped.log | shrinksam -v > BASE_NAME_scaffold_mapped.sam
> /groups/banfield/software/pipeline/v1.1/scripts/add_read_count.rb BASE_NAME_scaffold_mapped.sam BASE_NAME_scaffold.fa 150 > BASE_NAME_scaffold.fa.counted
> mv BASE_NAME_scaffold.fa.counted BASE_NAME_scaffold.fa

What this does is:

create a bt2 dir for storing the index
create the bt2 index
run bowtie2 mapping using the forward and reverse reads for the sample
add_read_count.rb creates a new fasta file with modified headers and this example assumes 150bp read length – output is redirected into a temporary file. add_read_count.rb can only read SAM files (not BAM).
replace the original scaffold file with the header-modified temporary file.

**Either delete the SAM file once this process is complete or if you anticipate using it for something else in the near future (e.g. automated binning), convert to a BAM file to save disc space:

> cat BASE_NAME_scaffold_mapped.sam | sambam > BASE_NAME_scaffold_mapped.bam
> rm BASE_NAME_scaffold_mapped.sam

You can generate some useful statistics on your assembly output using the contig_stats.pl script. It will tell you the assembly details, like N50, max contig length etc. and give you a nice size breakdown of your contigs/scaffolds. You can run it like so:

> contig_stats.pl -i BASE_NAME_scaffold.fa

Step 3/4 alternate path using myIDBA_UD.py

For now, assembly and following steps can be run out of your directory in /groups/banfield/users/ and final files for import will be moved by Rohan to assembly.d in the project’s directory within /group/banfield/sequences/2020/

The wrapper script, myIDBA_UD.py can be used to complete steps 3 & 4 with one command.

> /groups/banfield/software/pipeline/v1.1/scripts/myIDBA_UD.py -h
usage: myIDBA_UD.py [-h] -b BASENAME [-l READ_LENGTH] [-m MINLENGTH]
                    [-p PROCESSORS] [-n] [-q] [-v]
                    reads_file fwd rev

Wraps running IDBA_UD and pre-, post-processing steps

positional arguments:
  reads_file            shuffle reads in fasta
  fwd                   forward reads for mapping in fastq
  rev                   reverse reads for mapping in fastq

optional arguments:
  -h, --help            show this help message and exit
  -b BASENAME, --basename BASENAME
                        name to use as file base
  -l READ_LENGTH, --read_length READ_LENGTH
                        read length (default: 150)
  -m MINLENGTH, --minlength MINLENGTH
                        contig minimum length cutoff (default: 1000)
  -p PROCESSORS, --processors PROCESSORS
                        number of processors to use (default: 6)
  -n, --dry-run         Don't actually do anything
  -q, --queue           Submit job to queue (default: False)
  -v, --verbose

To run this script, you must give four key parameters:

The BASE_NAME for the project
The fasta file containing the shuffled reads
A fastq file containing just the forward reads (can be gzip’d)
A fastq file containing just the reverse reads (can be gzip’d)

To run on the cluster:

> /groups/banfield/software/pipeline/v1.1/scripts/myIDBA_UD.py BASE_NAME.PE.fa BASE_NAME.PE.1.fastq.gz BASE_NAME.PE.2.fastq.gz -b BASE_NAME -p 64 -q

If you want to run on biotite, remove the “-q” flag and change “-p” (processors) to a max of 16. If you have >100 million reads, talk to Rohan about assembling on the high memory cluster node. If you just want to see the commands that this script will run, but not actually run them, use the “–dry-run” parameter.

When the process is complete, the output directory will look like this…

> ls BASE_NAME_idba_ud/
BASE_NAME_mapped.sam           begin                       end                         scaffold.fa
BASE_NAME_scaffold.fa          bt2/                        log                         
BASE_NAME_scaffold_min1000.fa  contig.fa                   mapped.log

**Either delete the SAM file once this process is complete or if you anticipate using it for something else in the near future (e.g. automated binning), convert to a BAM file to save disc space:

> cat BASE_NAME_scaffold_mapped.sam | sambam > BASE_NAME_scaffold_mapped.bam
> rm BASE_NAME_scaffold_mapped.sam

Step 5: Gene and Small RNA Prediction

For annotating the assembly output, we need to predict genes, small RNAs, and then functionally annotate the predicted open reading frames. We use Prodigal for gene prediction.

Before we get started with predicting genes, remember that we need to size filter our scaffolds. By default, we use a minimum size cutoff of 1000bp. If you used myIDBA_UD.py above to create your assembly, then it will have already created the 1000bp minimum file for you.

Filtering your scaffolds is easy using pullseq. You must follow this naming scheme (i.e. “*_min1000.fa”):

> pullseq -i BASE_NAME_scaffold.fa -m 1000 > BASE_NAME_scaffold_min1000.fa

Running prodigal using the size-filtered scaffolds:

> prodigal -i BASE_NAME_scaffold_min1000.fa -o BASE_NAME_scaffold_min1000.fa.genes -a BASE_NAME_scaffold_min1000.fa.genes.faa -d BASE_NAME_scaffold_min1000.fa.genes.fna -m -p meta

The “-m” flag will prevent prodigal from making predictions that span “N”-containing stretches. This command will generate 3 output files: the gene predictions in GFF format; the corresponding protein sequences; the corresponding DNA sequences.

Finding the 16S rRNA genes:

Using the 16s.sh shell wrapper script which runs Chris Brown’s 16SfromHMM.py script with the default settings and the ssu-align-0p1.1.cm database. This shell script is run as follows (you must follow this naming scheme:

> /groups/banfield/software/pipeline/v1.1/scripts/16s.sh BASE_NAME_scaffold_min1000.fa > BASE_NAME_scaffold_min1000.fa.16s

Predicting tRNA genes using the tRNAscanSE wrapper script:

> /groups/banfield/software/pipeline/v1.1/scripts/trnascan_pusher.rb -i BASE_NAME_scaffold_min1000.fa > /dev/null 2>&1

(The reason for the redirection to /dev/null is that the tRNAscanSE perl script causes a ton of warnings from the perl interpreter due to coding style. These don’t affect the output, but are obnoxious nonetheless. Redirecting to /dev/null will hide them from your terminal window.)

Step 6: Annotation

We annotate the predicted proteins by doing similarity searches using usearch, comparing each protein sequence against the following databases:

KEGG (curated database with excellent metabolic pathways metadata)
UniRef100 (curated dataset derived from UniProt; provides functional and taxonomic information)
UniProt (a comprehensive, non-redundant database derived from numerous sources including Borg genes)

Annotation searches can be done on the cluster using the updated cluster_usearch_wrev-copy.edit.python.py script developed by Rohan. You will need to submit 3 jobs per sample to the cluster queue (one for each database). It may take a while to complete, depending on the size of your input file and the status of the queue. You may only submit 30 jobs at a time to the cluster. This is how they should be run:

> python3 /groups/banfield/users/rohan/ggdb/cluster_usearch_wrev-copy.edit.python.py -i /FULL/PATH/TO/BASE_NAME_scaffold_min1000.fa.genes.faa -d kegg -t 48

> python3 /groups/banfield/users/rohan/ggdb/cluster_usearch_wrev-copy.edit.python.py -i /FULL/PATH/TO/BASE_NAME_scaffold_min1000.fa.genes.faa -d uni -t 48

> python3 /groups/banfield/users/rohan/ggdb/cluster_usearch_wrev-copy.edit.python.py -i /FULL/PATH/TO/BASE_NAME_scaffold_min1000.fa.genes.faa -d uniprot -t 48

-i = input .faa file with predicted proteins
-d = database to search (kegg, uni, uniprot)
-t = number of threads (optional; default is usually ~48 on the cluster)

These commands automatically submit an sbatch job for each search using diamond blastp. You can modify or omit the -t 48 option depending on your resource needs.

When all your jobs are complete and off the cluster, you will know for two reasons:

The cluster queue doesn’t have any jobs owned by you still running or queued (type squeue to see your jobs on the cluster)
In the output directory, all expected .b6+ files are present and no jobs are actively writing to them (check file timestamps if unsure)

Output Files

This new script directly produces .b6+ files that include annotation metadata. You do not need to run annolookup.py, and .b6.gz intermediate files are no longer generated.

Example output files per sample:

BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6+
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6+
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6+

Summary

After you have gone through all of the above steps, you should have a directory that looks similar to the following:

> ls
begin
bt2/
BASE_NAME_scaffold.fa
BASE_NAME_scaffold_min1000.16S.cmsearch
BASE_NAME_scaffold_min1000.fa
BASE_NAME_scaffold_min1000.fa.16s
BASE_NAME_scaffold_min1000.fa.genes
BASE_NAME_scaffold_min1000.fa.genes.faa
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6+
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-kegg.b6.gz
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6+
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uni.b6.gz
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6+
BASE_NAME_scaffold_min1000.fa.genes.faa-vs-uniprot.b6.gz
BASE_NAME_scaffold_min1000.fa.genes.fna
BASE_NAME_scaffold_min1000.fa.trnascan
BASE_NAME_scaffold_min1000.fa.trnascan.fasta
contig.fa
end
log
mapped.log
scaffold.fa

**Note again the naming conventions used: using anything else but this convention will likely prevent the next step, ggKbase import…