A guide to genome curation outside of ggkbase.
Do you have MANY genomes to curate? See https://ggkbase-help.berkeley.edu/how-to/bulk-genome-curation/
Wondering how to get your already curated fa-file of a genome back into ggkbase? See https://ggkbase-help.berkeley.edu/overview/data-preparation-curated-genome/
Fix scaffolding errors with ra2.py
First fix scaffolding errors in the genome (*.fa) downloaded from ggkbase using ra2.
So what does ra2 do? (from CTB’s script and documentation on slack)
- Errors are identified based on regions of zero coverage by stringently
mapped reads (stringent mapping for both reads in a pair).
- Errors are re-assembled using reads stringently mapped to
error-containing region (stringent mapping for one read in a pair).
- Errors are replaced with the re-assembled sequence, if successful.
- Errors that could not be fixed are either left alone or replaced with Ns (option).
- Scaffolds will be split only If an error is not corrected AND paired-read insert coverage is zero.
$ ra2.py -i <genome.fa> -1 <forward_reads.fq> -2 <reverse_reads.fq> --add-Ns
* If doing additional curation afterwards, you may want to use --extend to extend the ends of scaffolds. However, since this can introduce redundancy, it is best not to use this unless doing additional curation to combine scaffolds with overlapping ends.
* Reads may be gzipped.
* See `$ ra2.py -h` for more info.
output will be in <genome>.curated directory:
re_assembled.fa is the curated genome
re_assembled.report.txt is the report (e = extended, n = error that was not fixed, f = error that was fixed, b = scaffold broken at error)
clean up ra2.py output:
ra2.py produces fasta files with headers that are not compatible with some downstream applications.
clean the fasta file:
$ ~cbrown/shared_scripts/fix_fasta.py <genome.curated/re_assembled.fa> | ~cbrown/shared_scripts/nr_fasta.py rename - > <genome.curated.fa>
fix_fasta.py removes characters which should not be in headers
nr_fasta.py removes duplicate entries by duplicate header names
"ra2.py saves mapped reads to memory and therefore can use a lot of memory when curating very large assemblies. I have had no problem running this on a single or even several genomes at one time. However, it should not be used on a metagenome assembly." - Chris
Re-map reads to the curated genome
After running ra2.py you should re-map the reads to the genome sequence for visual inspection.
Option 1: Subset fastq file, map with bowtie
Mapping all of the reads to reassembled.fa is both slow and may be hard to visualize in genious. You can use subset_reads.rb to create a subset of reads. Say you have a genome with coverage of 140x, but to visualize in genious you want it to be 20x, then you need to create a subset of 1/7 of the reads. This is indicated in the -p7 flag of subset_reads.rb. Input read files must be unzipped for this script.
For example:
subset_reads.rb -f <trim_clean.PE.1.fastq> -r <trim_clean.PE.2.fastq> -p 7
After creating a subset, map reads using bowtie. re_assembled.fa is renamed to the basename of a given project. For example Ig5772.fa. **Make sure to use stringent flag: --rfg 200,300 to prevent gapping.**
mv re_assembled.fa Ig5772.fa mkdir bt2 bowtie2-build Ig5772.fa bt2/Ig5772.fa /opt/bin/bio/bowtie2 --rfg 200,300 -x bt2/Ig5772.fa -1 /path/to/read_file/Ig5772_ATTCCT_L1_L2_trim_clean.PE.1.fastq.sub7 -2 /path/to/read_file/Ig5772_ATTCCT_L1_L2_trim_clean.PE.2.fastq.sub7 2> Ig5772_mapped.log | shrinksam -v > Ig5772_mapped.sam
Option 2: Subset sam file
As an alternative to subsetting the fastq file and rerunning bowtie, if you still have the sam file, you could just randomly subset that instead with subset_sam.py.
~cbrown/shared_scripts/subset_sam.py -h usage: subset_sam.py [-h] -s S -p P [--sort] [-b B] # randomly subset sam file optional arguments: -h, --help show this help message and exit -s S path to sorted sam file (- for stdin) -p P percent of reads to report, e.g. 50 (approximate) --sort sort the sam file -b B buffer size (GB) to use when sorting sam file (default = 100)
Option 3: Filter the reads
Often it is helpful to remove poorly mapped reads from the mapping file. Chris recommends using mapped.py to do this:
You can filter mapping with same criteria used by ra2.py:
$ ~cbrown/shared_scripts/mapped.py -s <mapping.sam> -m 1 -p both -o
* This command allows for one mismatch in the read mapping. The “-p” option has to do with whether or not the mismatch requirement applies to one or both reads in a pair. In this case the requirement apples to both reads, unless one of the reads did not map.
* See `~cbrown/shared_scripts/mapped.py -h` for more info.
Filter without stringent criteria:
Also useful is to collect the mapped reads themselves, but without stringent criteria. These tend to be essential for manually fixing genome errors.
$ /home/cbrown/shared_scripts/mapped.py -m False -p one -s -r > 2>
Note: If you have already run `shrinksam` on the SAM file it may be easier to just convert the entire file to FASTQ. For example, using `~cbrown/shared_scripts/sam2fastq.py`.
Add read counts to curated fa file
If you skip this step, your genome will lack read counts for a given gene on ggkbase.
$ add_read_count.rb test.sam curated.fa > curated.fa.counted $ mv curated.fa.counted curated.fa