Gene prediction
1.     Step 1: Pre-processing
1.     Step 2: Genome scan
1.     Step 3: Transcript prediction
1.     Step 4: Quality control
1.     Quality indices of predicted transcripts
    Orthology assignment
2.     Step1: Orthology assignment between species pairs
2.     Step 2: Clustering
2.     Step 3: Multiple alignment
2.     Step 4: Phylogenetic tree topologies
2.     Step 5: Orthology assignment
2.     Step 6: Evolutionary rate estimation
2.     Step 7: Definition of simple ortholog sets
Genes are predicted by homology using a pipeline centered around Guy Slater's exonerate tool. The pipeline predicts transcripts in a target genome using a set of peptide sequences from a reference genome. These constitute the set of reference transcripts.

The principal problem in prediction by homology is the selection of a template sequence. Alternative transcripts and paralogous genes in both the template genome and or in the target species result in cases in which a single reference transcript can be aligned to several candidate regions in the target genome and, vice versa, a candidate region might be matched to several transcripts. The pipeline solves these problems by (1) selecting a representative transcript from each reference gene and (2) pre-selecting likely pairings of orthologous transcripts and candidate regions and processing these first. Paralogous transcripts and variant transcripts are then predicted in subsequent steps.

The pipeline runs on GNU/Linux and is controlled by a set of python scripts. Results are stored in a relation database served by postgres.

The transcript prediction pipeline falls into four steps corresponding to the section subheadings below:
  1. Pre-processing of template sequences,
  2. identification of transcript containing loci,
  3. prediction of transcripts in such loci, and
  4. quality control.

We cluster into genes those template transcripts that share at least one exon with identical exon boundaries. For each gene, the longest unspliced transcript is chosen as its representative; all others are designated as variants. The set of representative transcripts is required for steps 2 and 3.

The second step produces candidate transcript-containing regions from within the eleven newly sequenced genome assemblies. The amino acid sequences encoded in the representative template transcripts are masked for low complexity regions using SEG (Wootton 1994, default options). These masked peptide sequences are then aligned against the unmasked genomes using Exonerate in heuristic mode with sensitive search options.

Exonerate options for genome scan:
	  model=p2g
	  forcegtag=TRUE
	  bestn=200
	  maxintron=50000
	  proteinwordthreshold=3
	  proteinhspdropoff=5
	  proteinwordlen=5 
	  score cutoff=50

These settings produce a multitude of matches between template amino acid sequences and genomic regions. The level of sensitivity for identifying matches was found to be comparable to that of TBLASN. Adjacent matches within 50kB associated with the same template are merged and their scores added. Matches of aggregate scores of less than 80 are discarded as representing likely spurious alignments. The result of this step is a list of pairs of transcripts and matching genomic regions.

The third step produces transcript predictions by aligning the template amino acid sequence to its paired genomic region using the Genewise mode of Exonerate. When the genomic region to be searched is large, this is computationally expensive. At the same time, paralogous template sequences in the input set and in the target genome will give rise to a multitude of redundant predictions. Consequently, the pipeline tries to minimize the numbers of alignments that are performed by first processing pairs between orthologous pairs. Only then it will process the remaining paralogous pairs to properly account for lineage specific duplications. Finally, variant transcripts as opposed to representative transcripts are aligned to genomic segments, where its representative transcript matched previously.

Pairs of matching template transcripts and genomic regions are predicted in three passes. Each pass selects a subset of pairs of transcripts and genomic regions to predict with exonerate (see next section).

The first pass processes likely ortholog matches by discarding all low scoring matches. For a given template these are defined as having an alignment coverage, score or percent identity to the query that is less than half of the best coverage, score or percent identity obtained for any predictions from this particular template sequence.

The second pass excludes all pairs of of templates and genomic regions that overlap with the genomic locations of transcripts predicted in the first pass. The purpose of this step is to predict paralogous sequences.

The third pass predicts transcripts which are variants to those predicted in passes one and two. For each prediction in pass one and two, variant as opposed to representative template transcripts (see step 1) are aligned to the same genomic region as the representative template transcript. This pass does not make use of priority lists.

Prediction of transcripts from a list of pairs of aligned templates with genomic regions

The resulting collection of pairs between templates and various genomic regions is transformed into a collection of priority lists. The priority lists are used to solve the problem of paralogous template queries that will align to the same region containing a homologous gene in the target genome. The priority list ensures, that the likely ortholog sequence is selected for transcript prediction before attempting prediction with paralogous sequences.

Priority lists are built from matches encompassed within the same genomic region. All templates in each priority list are first grouped into clusters on the basis of sequence identity (at least 60%) and alignment coverage (at least 50%). Each cluster thus groups homologous template transcripts that align to a particular genomic region. Each cluster is sorted by decreasing alignment score between the template and genomic prediction. The templates are thus arranged in a priority list, such that more similar templates are used before less similar ones to prediction transcript. As soon as a satisfactory prediction is returned, less similar template sequences are skipped. At most, five templates are in each priority list.

Templates are aligned in order of their priority. A prediction is defined as being satisfactory, if its exon structure corresponds to the exon structure of the template. To this end, exon boundaries of the prediction are mapped onto the template sequence. The exon structure of multi-exon genes is considered to be conserved if at most one exon is missing and/or at most one exon boundary has shifted. Single exon predictions are c onsidered to be conserved if the template is a single exon protein and the prediction aligns to at least 80% of this template.

In order to reduce computation time, the size of the genomic region presented to Exonerate is determined first by another search with Exonerate in heuristic mode, but using more sensitive settings than used in the initial genomic scan (options: proteinwordthreshold=5, proteinhspdropoff=5, proteinwordlen=3). The region is extended with 50kb at either end, unless the template.s ends are aligned, in which case only 0.3kb are added. Transcripts are predicted from template peptide sequences in the final alignment using Exonerate's genewise mode

	    --exhaustive
	    subopt=FALSE
	    forcegtag=TRUE
	    subopt=FALSE

Even though Exonerate permits the detection of suboptimal matches, this capability was switched off as it led to excessive running time. Instead, if a match is found, adjacent regions of 100kB on either side are checked for local duplications. The local search for paralogs is repeated, until no more matches are found.

The pipeline attempts to predict transcripts in a genomic region until it identifies a satisfactory prediction (as defined using a set of heuristics, see below). In multi-gene families several equally suitable templates result in identical or highly similar predictions. The resultant set of predictions thus needed to be filtered to reduce redundancy. Similarly, inappropriate templates might have been used for prediction. Such spurious predictions have to be removed since they are most often invalidated by a better matching, orthologous prediction. To this end, an index is assigned to each prediction that is intended to reflect whether it is orthologous, for all exons, to the template.

This quality index aggregates the following attributes: alignment coverage of the template, presence of frameshifts and/or stop-codons, and conservation of exon structure when compared to its template. Indices are ranked to present an approximate indicator of the prediction quality. Briefly, full length predictions are preferred over incomplete predictions (namely, less than 80% alignment coverage of the template); predictions without in-frame stop-codons and frameshifts are preferred over those containing frame disruptions; and finally, predictions with matching exon boundaries between template and prediction are preferred over those with shifted exon boundaries or those with inserted or deleted introns.

Redundant and spurious predictions are removed by applying a set of heuristic rules. Briefly, predictions are sorted by quality index, length and alignment score. Once a prediction is accepted, other overlapping predictions satisfying one or more of the following rules are removed: (1) the predicted peptide sequence is identical to the template sequence or a part thereof; (2) the prediction is of a lower quality index and the sequences are essentially identical (better than 98% sequence identity, fewer than 20 gap positions); (3) the prediction is a fragment and its sequence is more than 80% identical to the accepted prediction; and (4) the prediction has a lower quality index and spans the better ranked prediction on either strand. These rules permit alternative transcripts per locus but disallow truncated transcripts. An additional filter removes two other types of predictions, those sharing exons between adjacent duplications, and those predictions without conserved gene structure that span other predictions with conserved gene structure, but do not contain overlapping exons.

These rules remove dubious predictions that are invalidated by a better prediction. Nevertheless, those predictions of low confidence but showing no clear favorite are all retained, which reflects the difficulty of predicting that particular transcript. A particular problem arises from genes that are duplicated locally. These can result in spurious transcripts containing exons from adjacent genes. In most cases, exon boundaries between duplicates are conserved and thus all transcripts will possess the same quality index. We found no heuristic to reliably winnow out these spurious transcripts while keeping true alternative transcripts. Instead, we defer identification of these transcripts until after the orthology assignment. Transcript predictions are combined into genes if they share at least one identical exon. Partial overlap is allowed for single exon genes and terminal exons, although at least 80% nucleotide overlap is required. The quality index of a gene is assigned as the best quality index of its transcripts.

Code Description Frag Pseudo Cons PCons NInt QNInt Comment
Genes
CGconserved geneNNYY>0?likely ortholog assignment
SGsingle exon geneNN??00single exon prediction for single exon query
PGpartially conserved geneNNNY>0?paralog or alternative transcript?
RGretrotransposed geneNN??0>1retrotransposed gene (identical copy, all introns lost)
'UGunconserved geneNN????other full length predictions
Pseudogenes
CPduplicated pseudogeneNYY?>0?pseudogene with identical gene structure as query
SPsingle exon pseudogeneNY??00single exon pseudogene, derived from a single exon query
PPpartially conserved pseudoNYNY>0?partially conserved pseudogene
RPprocessed pseudogeneNY??0>1retrotransposed pseudogene (identical copy, all introns lost)
UPunconserved pseudogeneNY????other full length pseudogenes
Fragments
SFsingle exon fragmentYN??00fragment of single exon from single exon gene
CFconserved fragmentYNY???conserved fragment
PFpartially conserved fragmentYNNY??partially conserved fragment
UFunconserved fragmentYN????other fragments
BFpseudogenic fragmentYY????pseudogenic fragment
Legend
Fragis a fragment (less than 80% coverage of query).
Pseudois a pseudogene (frameshifts/in frame stop codons).
Conshas conserved exon structure.
PConshas partially conserved exon structure.
NIntnumber of introns in prediction.
QNIntnumber of introns in query.
In our terminology, we follow the definitions of (Fitch 1970, Remm et al. 2001): orthologs are related through speciation, while paralogs are related through an intra-genome duplication event. In-paralogs duplicated after the speciation of the two species being compared, whereas out-paralogs duplicated before speciation. Simple orthologous pairs are those with only a single copy in each genome, while degenerate orthologous pairs contain in-paralogs for at least one species. Our orthology prediction pipeline follows the following steps:
  1. Orthology assignment between species pairs
    1. All-on-all BLASTP
    2. PhyOp pairwise tree-based orthology assignment (Goodstadt and Ponting 2006)
  2. Clustering of pairwise orthology assignments
  3. Multiple alignment of sequences within clusters
  4. Estimation of phylogenetic tree topologies
  5. Derivation of orthologous groups
  6. Evolutionary rate estimation
  7. Definition of simple orthologous sets

For each pair of genomes, amino acid sequences of full length predictions (>80% coverage of template) are collated. Each amino acid sequence is aligned against every other sequence using BLASTP (Altschul et al. 1997). Alignments with an E-value of more than 10-5 or covering less than 75% of the smaller sequence are removed. The remaining alignments are weighted according to a normalized bit score

Briefly, the PhyOP pipeline then constructs an initial transcript phylogeny based on the UPGMA method and then optimizes branches in the tree using KITSCH (30 iterations, down weighting large distances by a power of 3) modified from the PHYLIP package (Felsenstein 1989). Orthologs are assigned to members of the smallest subtree containing leaves of both species. Nodes closer to the root than a node joining branches of orthologs are considered to be orphans. Depending on the tree topology, several predictions in one species (in-paralogs) can be assigned to the same ortholog in another species. Distances based on amino acid substitutions can be quickly derived from BLASTP alignments, but in some circumstances they can be an unreliable guide for orthology assignment (Goodstadt and Ponting 2006). We thus verified pairwise orthology relationships using the numbers of synonymous substitutions per synonymous site dS calculated with CodeML from the PAML package (Yang 1997) based on the pairwise BLASTP alignments calculated previously. Only orthologous and in-paralogous pairs were considered, thereby reducing computational time. The resultant set of alignments was submitted to another iteration of PhyOP orthology assignment.

Pairwise orthologs are grouped into clusters across all genomes in a clade using a graph-based clustering method. Vertices in the graph are genes. Vertices are connected by edges if they have been predicted to be in a pairwise orthology relationship by PhyOp. The clusters are then given by connected components within the graph. Genes within a cluster are aligned as peptide sequences using muscle. To this end, all alternative transcripts in a gene were combined into a single virtual gene by concatening all exons in genomic order accounting for frameshifts. After alignment, nucleotide sequences were threaded back onto the aligned peptide sequence and resolved into invidual transcripts. This resulted in four alignments per cluster:
  • aligned virtual genes as amino acids,
  • aligned virtual genes as codons,
  • aligned transcripts as amino acids, and
  • aligned transcritps as codons
All of these alignments are available in the server and for download. Further downstream analyses are based on aligned virtual genes, as these capture information of all alternative transcripts of a gene. Phylogenetic trees were computed from codon based multiple alignments of virtual genes using TreeBeST with the following command line options:
-best
This option computes phylogenetic trees based on four distances and combines them according to bootstrap support. TreeBeST demands a species phylogeny as input. We provide standard phylogenies from the literature. The species tree was reconciled with the gene tree using the RIO method (Zmasek and Eddy 2002), which assigns speciation and duplication events to nodes in the gene tree. Based on the tree topology, clusters are split into orthologous groups if they contain out-paralogs according to the established phylogeny in the clade. Thus, an orthologous group contains no genes that have duplicated before the common ancestor of the clade. Branch lenghts were estimated using codonml from the PAML package using the basic Goldman & Yang (1994) model. The following listing presents a typical control file:
* Sequence type is codons
   seqtype = 1
* Codon: one omega per branch - AA: poisson
   model = 0
* default initial or fixed kappa value.
   kappa = 2
* detailed output
   verbose = 1
* Codon frequencies given by F3X4
   CodonFreq = 2
* no clock
   clock = 0
* default initial of fixed omega value
   omega = 0.4
* user tree
   runmode = 0
* inifinity: fixed alpha
   alpha = 0
* Genetic code: universal
   icode = 0
* no site-specific variation of omega
   NSsites = 0
* no correlation between sites
   rho = 0
* alpha is fixed
   fix_alpha = 1
* kappa is to be estimated
   fix_kappa = 0
* default value for small difference
   Small_Diff = .5e-6
* omega is to be estimated
   fix_omega = 0
* rho is fixed
   fix_rho = 1
* how much rubbish on screen
   noisy = 9
Simple ortholog sets are derived for each orthologous group and possible phylogenetic patterns from the tree.