|1.||1||Step 1: Pre-processing|
|1.||2||Step 2: Genome scan|
|1.||3||Step 3: Transcript prediction|
|1.||4||Step 4: Quality control|
|1.||5||Quality indices of predicted transcripts|
|2.||1||Step1: Orthology assignment between species pairs|
|2.||2||Step 2: Clustering|
|2.||3||Step 3: Multiple alignment|
|2.||4||Step 4: Phylogenetic tree topologies|
|2.||5||Step 5: Orthology assignment|
|2.||6||Step 6: Evolutionary rate estimation|
|2.||7||Step 7: Definition of simple ortholog sets|
The principal problem in prediction by homology is the selection of a template sequence. Alternative transcripts and paralogous genes in both the template genome and or in the target species result in cases in which a single reference transcript can be aligned to several candidate regions in the target genome and, vice versa, a candidate region might be matched to several transcripts. The pipeline solves these problems by (1) selecting a representative transcript from each reference gene and (2) pre-selecting likely pairings of orthologous transcripts and candidate regions and processing these first. Paralogous transcripts and variant transcripts are then predicted in subsequent steps.
We cluster into genes those template transcripts that share at least one exon with identical exon boundaries. For each gene, the longest unspliced transcript is chosen as its representative; all others are designated as variants. The set of representative transcripts is required for steps 2 and 3.
The second step produces candidate transcript-containing regions from within the eleven newly sequenced genome assemblies. The amino acid sequences encoded in the representative template transcripts are masked for low complexity regions using SEG (Wootton 1994, default options). These masked peptide sequences are then aligned against the unmasked genomes using Exonerate in heuristic mode with sensitive search options.Exonerate options for genome scan:
model=p2g forcegtag=TRUE bestn=200 maxintron=50000 proteinwordthreshold=3 proteinhspdropoff=5 proteinwordlen=5 score cutoff=50
These settings produce a multitude of matches between template amino acid sequences and genomic regions. The level of sensitivity for identifying matches was found to be comparable to that of TBLASN. Adjacent matches within 50kB associated with the same template are merged and their scores added. Matches of aggregate scores of less than 80 are discarded as representing likely spurious alignments. The result of this step is a list of pairs of transcripts and matching genomic regions.
The third step produces transcript predictions by aligning the template amino acid sequence to its paired genomic region using the Genewise mode of Exonerate. When the genomic region to be searched is large, this is computationally expensive. At the same time, paralogous template sequences in the input set and in the target genome will give rise to a multitude of redundant predictions. Consequently, the pipeline tries to minimize the numbers of alignments that are performed by first processing pairs between orthologous pairs. Only then it will process the remaining paralogous pairs to properly account for lineage specific duplications. Finally, variant transcripts as opposed to representative transcripts are aligned to genomic segments, where its representative transcript matched previously.
Pairs of matching template transcripts and genomic regions are predicted in three passes. Each pass selects a subset of pairs of transcripts and genomic regions to predict with exonerate (see next section).
The first pass processes likely ortholog matches by discarding all low scoring matches. For a given template these are defined as having an alignment coverage, score or percent identity to the query that is less than half of the best coverage, score or percent identity obtained for any predictions from this particular template sequence.
The second pass excludes all pairs of of templates and genomic regions that overlap with the genomic locations of transcripts predicted in the first pass. The purpose of this step is to predict paralogous sequences.
The third pass predicts transcripts which are variants to those predicted in passes one and two. For each prediction in pass one and two, variant as opposed to representative template transcripts (see step 1) are aligned to the same genomic region as the representative template transcript. This pass does not make use of priority lists.
Prediction of transcripts from a list of pairs of aligned templates with genomic regions
The resulting collection of pairs between templates and various genomic regions is transformed into a collection of priority lists. The priority lists are used to solve the problem of paralogous template queries that will align to the same region containing a homologous gene in the target genome. The priority list ensures, that the likely ortholog sequence is selected for transcript prediction before attempting prediction with paralogous sequences.
Priority lists are built from matches encompassed within the same genomic region. All templates in each priority list are first grouped into clusters on the basis of sequence identity (at least 60%) and alignment coverage (at least 50%). Each cluster thus groups homologous template transcripts that align to a particular genomic region. Each cluster is sorted by decreasing alignment score between the template and genomic prediction. The templates are thus arranged in a priority list, such that more similar templates are used before less similar ones to prediction transcript. As soon as a satisfactory prediction is returned, less similar template sequences are skipped. At most, five templates are in each priority list.
Templates are aligned in order of their priority. A prediction is defined as being satisfactory, if its exon structure corresponds to the exon structure of the template. To this end, exon boundaries of the prediction are mapped onto the template sequence. The exon structure of multi-exon genes is considered to be conserved if at most one exon is missing and/or at most one exon boundary has shifted. Single exon predictions are c onsidered to be conserved if the template is a single exon protein and the prediction aligns to at least 80% of this template.
In order to reduce computation time, the size of the genomic region presented to Exonerate is determined first by another search with Exonerate in heuristic mode, but using more sensitive settings than used in the initial genomic scan (options: proteinwordthreshold=5, proteinhspdropoff=5, proteinwordlen=3). The region is extended with 50kb at either end, unless the template.s ends are aligned, in which case only 0.3kb are added. Transcripts are predicted from template peptide sequences in the final alignment using Exonerate's genewise mode
--exhaustive subopt=FALSE forcegtag=TRUE subopt=FALSE
Even though Exonerate permits the detection of suboptimal matches, this capability was switched off as it led to excessive running time. Instead, if a match is found, adjacent regions of 100kB on either side are checked for local duplications. The local search for paralogs is repeated, until no more matches are found.
The pipeline attempts to predict transcripts in a genomic region until it identifies a satisfactory prediction (as defined using a set of heuristics, see below). In multi-gene families several equally suitable templates result in identical or highly similar predictions. The resultant set of predictions thus needed to be filtered to reduce redundancy. Similarly, inappropriate templates might have been used for prediction. Such spurious predictions have to be removed since they are most often invalidated by a better matching, orthologous prediction. To this end, an index is assigned to each prediction that is intended to reflect whether it is orthologous, for all exons, to the template.
This quality index aggregates the following attributes: alignment coverage of the template, presence of frameshifts and/or stop-codons, and conservation of exon structure when compared to its template. Indices are ranked to present an approximate indicator of the prediction quality. Briefly, full length predictions are preferred over incomplete predictions (namely, less than 80% alignment coverage of the template); predictions without in-frame stop-codons and frameshifts are preferred over those containing frame disruptions; and finally, predictions with matching exon boundaries between template and prediction are preferred over those with shifted exon boundaries or those with inserted or deleted introns.
Redundant and spurious predictions are removed by applying a set of heuristic rules. Briefly, predictions are sorted by quality index, length and alignment score. Once a prediction is accepted, other overlapping predictions satisfying one or more of the following rules are removed: (1) the predicted peptide sequence is identical to the template sequence or a part thereof; (2) the prediction is of a lower quality index and the sequences are essentially identical (better than 98% sequence identity, fewer than 20 gap positions); (3) the prediction is a fragment and its sequence is more than 80% identical to the accepted prediction; and (4) the prediction has a lower quality index and spans the better ranked prediction on either strand. These rules permit alternative transcripts per locus but disallow truncated transcripts. An additional filter removes two other types of predictions, those sharing exons between adjacent duplications, and those predictions without conserved gene structure that span other predictions with conserved gene structure, but do not contain overlapping exons.
These rules remove dubious predictions that are invalidated by a better prediction. Nevertheless, those predictions of low confidence but showing no clear favorite are all retained, which reflects the difficulty of predicting that particular transcript. A particular problem arises from genes that are duplicated locally. These can result in spurious transcripts containing exons from adjacent genes. In most cases, exon boundaries between duplicates are conserved and thus all transcripts will possess the same quality index. We found no heuristic to reliably winnow out these spurious transcripts while keeping true alternative transcripts. Instead, we defer identification of these transcripts until after the orthology assignment. Transcript predictions are combined into genes if they share at least one identical exon. Partial overlap is allowed for single exon genes and terminal exons, although at least 80% nucleotide overlap is required. The quality index of a gene is assigned as the best quality index of its transcripts.
|CG||conserved gene||N||N||Y||Y||>0||?||likely ortholog assignment|
|SG||single exon gene||N||N||?||?||0||0||single exon prediction for single exon query|
|PG||partially conserved gene||N||N||N||Y||>0||?||paralog or alternative transcript?|
|RG||retrotransposed gene||N||N||?||?||0||>1||retrotransposed gene (identical copy, all introns lost)|
|'UG||unconserved gene||N||N||?||?||?||?||other full length predictions|
|CP||duplicated pseudogene||N||Y||Y||?||>0||?||pseudogene with identical gene structure as query|
|SP||single exon pseudogene||N||Y||?||?||0||0||single exon pseudogene, derived from a single exon query|
|PP||partially conserved pseudo||N||Y||N||Y||>0||?||partially conserved pseudogene|
|RP||processed pseudogene||N||Y||?||?||0||>1||retrotransposed pseudogene (identical copy, all introns lost)|
|UP||unconserved pseudogene||N||Y||?||?||?||?||other full length pseudogenes|
|SF||single exon fragment||Y||N||?||?||0||0||fragment of single exon from single exon gene|
|CF||conserved fragment||Y||N||Y||?||?||?||conserved fragment|
|PF||partially conserved fragment||Y||N||N||Y||?||?||partially conserved fragment|
|UF||unconserved fragment||Y||N||?||?||?||?||other fragments|
|BF||pseudogenic fragment||Y||Y||?||?||?||?||pseudogenic fragment|
|Frag||is a fragment (less than 80% coverage of query).|
|Pseudo||is a pseudogene (frameshifts/in frame stop codons).|
|Cons||has conserved exon structure.|
|PCons||has partially conserved exon structure.|
|NInt||number of introns in prediction.|
|QNInt||number of introns in query.|
For each pair of genomes, amino acid sequences of full length predictions (>80% coverage of template) are collated. Each amino acid sequence is aligned against every other sequence using BLASTP (Altschul et al. 1997). Alignments with an E-value of more than 10-5 or covering less than 75% of the smaller sequence are removed. The remaining alignments are weighted according to a normalized bit score
Briefly, the PhyOP pipeline then constructs an initial transcript phylogeny based on the UPGMA method and then optimizes branches in the tree using KITSCH (30 iterations, down weighting large distances by a power of 3) modified from the PHYLIP package (Felsenstein 1989). Orthologs are assigned to members of the smallest subtree containing leaves of both species. Nodes closer to the root than a node joining branches of orthologs are considered to be orphans. Depending on the tree topology, several predictions in one species (in-paralogs) can be assigned to the same ortholog in another species. Distances based on amino acid substitutions can be quickly derived from BLASTP alignments, but in some circumstances they can be an unreliable guide for orthology assignment (Goodstadt and Ponting 2006). We thus verified pairwise orthology relationships using the numbers of synonymous substitutions per synonymous site dS calculated with CodeML from the PAML package (Yang 1997) based on the pairwise BLASTP alignments calculated previously. Only orthologous and in-paralogous pairs were considered, thereby reducing computational time. The resultant set of alignments was submitted to another iteration of PhyOP orthology assignment.
-bestThis option computes phylogenetic trees based on four distances and combines them according to bootstrap support. TreeBeST demands a species phylogeny as input. We provide standard phylogenies from the literature.
* Sequence type is codons seqtype = 1 * Codon: one omega per branch - AA: poisson model = 0 * default initial or fixed kappa value. kappa = 2 * detailed output verbose = 1 * Codon frequencies given by F3X4 CodonFreq = 2 * no clock clock = 0 * default initial of fixed omega value omega = 0.4 * user tree runmode = 0 * inifinity: fixed alpha alpha = 0 * Genetic code: universal icode = 0 * no site-specific variation of omega NSsites = 0 * no correlation between sites rho = 0 * alpha is fixed fix_alpha = 1 * kappa is to be estimated fix_kappa = 0 * default value for small difference Small_Diff = .5e-6 * omega is to be estimated fix_omega = 0 * rho is fixed fix_rho = 1 * how much rubbish on screen noisy = 9