A guide to genome curation outside of ggkbase.
Do you have MANY genomes to curate? See https://ggkbase-help.berkeley.edu/how-to/bulk-genome-curation/
Wondering how to get your already curated fa-file of a genome back into ggkbase? See https://ggkbase-help.berkeley.edu/overview/data-preparation-curated-genome/
Fix scaffolding errors with ra2.py
First fix scaffolding errors in the genome (*.fa) downloaded from ggkbase using ra2.
So what does ra2 do? (from CTB’s script and documentation on slack)
- Errors are identified based on regions of zero coverage by stringently
mapped reads (stringent mapping for both reads in a pair).
- Errors are re-assembled using reads stringently mapped to
error-containing region (stringent mapping for one read in a pair).
- Errors are replaced with the re-assembled sequence, if successful.
- Errors that could not be fixed are either left alone or replaced with Ns (option).
- Scaffolds will be split only If an error is not corrected AND paired-read insert coverage is zero.
$ ra2.py -i <genome.fa> -1 <forward_reads.fq> -2 <reverse_reads.fq> --add-Ns
* If doing additional curation afterwards, you may want to use –extend to extend the ends of scaffolds. However, since this can introduce redundancy, it is best not to use this unless doing additional curation to combine scaffolds with overlapping ends.
* Reads may be gzipped.
* See `$ ra2.py -h` for more info.
output will be in <genome>.curated directory:
re_assembled.fa is the curated genome
re_assembled.report.txt is the report (e = extended, n = error that was not fixed, f = error that was fixed, b = scaffold broken at error)
clean up ra2.py output:
ra2.py produces fasta files with headers that are not compatible with some downstream applications.
clean the fasta file:
$ ~cbrown/shared_scripts/fix_fasta.py <genome.curated/re_assembled.fa> | ~cbrown/shared_scripts/nr_fasta.py rename - > <genome.curated.fa>
fix_fasta.py removes characters which should not be in headers
nr_fasta.py removes duplicate entries by duplicate header names
“ra2.py saves mapped reads to memory and therefore can use a lot of memory when curating very large assemblies. I have had no problem running this on a single or even several genomes at one time. However, it should not be used on a metagenome assembly.” – Chris
Re-map reads to the curated genome
After running ra2.py you should re-map the reads to the genome sequence for visual inspection.
Option 1: Subset fastq file, map with bowtie
Mapping all of the reads to reassembled.fa is both slow and may be hard to visualize in genious. You can use subset_reads.rb to create a subset of reads. Say you have a genome with coverage of 140x, but to visualize in genious you want it to be 20x, then you need to create a subset of 1/7 of the reads. This is indicated in the -p7 flag of subset_reads.rb. Input read files must be unzipped for this script.
For example:
subset_reads.rb -f <trim_clean.PE.1.fastq> -r <trim_clean.PE.2.fastq> -p 7
After creating a subset, map reads using bowtie. re_assembled.fa is renamed to the basename of a given project. For example Ig5772.fa. **Make sure to use stringent flag: –rfg 200,300 to prevent gapping.**
mv re_assembled.fa Ig5772.fa mkdir bt2 bowtie2-build Ig5772.fa bt2/Ig5772.fa /opt/bin/bio/bowtie2 --rfg 200,300 -x bt2/Ig5772.fa -1 /path/to/read_file/Ig5772_ATTCCT_L1_L2_trim_clean.PE.1.fastq.sub7 -2 /path/to/read_file/Ig5772_ATTCCT_L1_L2_trim_clean.PE.2.fastq.sub7 2> Ig5772_mapped.log | shrinksam -v > Ig5772_mapped.sam
Option 2: Subset sam file
As an alternative to subsetting the fastq file and rerunning bowtie, if you still have the sam file, you could just randomly subset that instead with subset_sam.py.
~cbrown/shared_scripts/subset_sam.py -h usage: subset_sam.py [-h] -s S -p P [--sort] [-b B] # randomly subset sam file optional arguments: -h, --help show this help message and exit -s S path to sorted sam file (- for stdin) -p P percent of reads to report, e.g. 50 (approximate) --sort sort the sam file -b B buffer size (GB) to use when sorting sam file (default = 100)
Option 3: Filter the reads
Often it is helpful to remove poorly mapped reads from the mapping file. Chris recommends using mapped.py to do this:
You can filter mapping with same criteria used by ra2.py:
$ ~cbrown/shared_scripts/mapped.py -s <mapping.sam> -m 1 -p both -o
* This command allows for one mismatch in the read mapping. The “-p” option has to do with whether or not the mismatch requirement applies to one or both reads in a pair. In this case the requirement apples to both reads, unless one of the reads did not map.
* See `~cbrown/shared_scripts/mapped.py -h` for more info.
Filter without stringent criteria:
Also useful is to collect the mapped reads themselves, but without stringent criteria. These tend to be essential for manually fixing genome errors.
$ /home/cbrown/shared_scripts/mapped.py -m False -p one -s -r > 2>
Note: If you have already run `shrinksam` on the SAM file it may be easier to just convert the entire file to FASTQ. For example, using `~cbrown/shared_scripts/sam2fastq.py`.
Add read counts to curated fa file
If you skip this step, your genome will lack read counts for a given gene on ggkbase.
$ add_read_count.rb test.sam curated.fa > curated.fa.counted $ mv curated.fa.counted curated.fa