Warning: pg_exec() [function.pg-exec]: No PostgreSQL link opened yet in /a/fr-06/lab/proto/www-prod/global_functions/inc_global_constants.php on line 34

Warning: pg_exec() [function.pg-exec]: No PostgreSQL link opened yet in /a/fr-06/lab/proto/www-prod/global_functions/inc_global_constants.php on line 59
ProtoNet : Methods

  ProtoNet Tools

   News
   Introduction
   Methods
   Site Map
   Guided Tour

  Related Links
  ProtoNet Team
  Feedback
   Cite us

history

 

Methods

The data presented in this site is based on extensive pre-computation. The foundation for the classification of proteins is a calculation of 'all-against-all' similarity for the set of proteins. The similarity measure is based on the Expect value ("E-score") of the BLAST® algorithm. This similary measure induces a weighted directed graph which has the protein sequences as its vertices.

Clustering Algorithm

The clustering algorithm is implemented in two phases. In the first phase, edges representing low similarity are removed from the graph. The second phase is iterative in nature, and involves merging clusters based on their mutual distances. At first, each protein constitutes a cluster of its own (a "leaf"), and in each clustering step two clusters which have the highest similarity (smallest distance) are merged into one "parent" cluster. The way in which similarity of clusters is defined is the differentiating factor for the clustering algorithms.

A new advanced clustering algorithm was implemented from 2008 for ProtoNet. For more details see:

Yaniv Loewenstein, Elon Portugaly, Menachem Fromer, and Michal Linial. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics, 24(13):i41-49, 2008.

The clustering source code is available here:
MC-UPGMA source code

Multiple Clusterings

ProtoNet supports three different clustering methods, each with its own way of defining similarity between clusters:
Harmonic classification uses the harmonic average of protein distances.
Geometric classification uses the geometric average of protein distances.
Arithmetic classification uses the arithmetic average of protein distances.

The similarity between clusters is used for the crucial step: the repeated merging of clusters.

Hanging of TrEMBL proteins (applied in versions prior to SWP 41.21 and Tr24.8)

The clustering algorithm described above was only run on the Swiss-Prot proteins of UniProt. The TrEMBL proteins, a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot, were not used to build the "skeleton" ProtoNet tree because they have not yet been manually curated. The TrEMBL proteins were integrated (starting from Swiss-Prot 41.21) into the "skeleton" ProtoNet tree, built only from Swiss-Prot, in the following way:

Each such TrEMBL protein was integrated into the ProtoNet tree independently and without consideration of previously added TrEMBL proteins, so that the order of their addition would not affect the process. For each TrEMBL protein, the most "similar" Swiss-Prot protein is found (using the similarity measure based on the BLAST algorithm, as described above). The TrEMBL protein is initally attached ("hung") to the leaf of this Swiss-Prot protein in the ProtoNet tree. Then, the TrEMBL protein iteratively ascends in the path from this leaf to the root, as long as its addition to the parent cluster (of its cluster in the previous iteration) improves the merging score of the parent cluster. For example, if Arithmetic classification is being used, then this condition requires that the arithmetic average of protein E-score distances in the cluster will decrease by the addition of this TrEMBL protein; i.e. the proteins in the clusters will be, on average, more similar with this protein present than without it. The protein is then "hung" in the ProtoNet forest at the location of its final iteration.

In versions ProtoNet 5.1 and ProtoNet 6.0, pre-calculated all vs all BLAST results of UniRef50 are computed. Thus, the hanging of TrEMBL protein protocol is not applicable.

Forest Condensation

The clustering process outlined above results in a group of «trees» of clusters (a "forest"). Following this process, the trees in the forest are "trimmed" by automatic parametric removal of clusters, so as to reduce the complexity of the trees.
For example, the "simplified mode" of this Web site uses a ProtoLevel-based criterion: each cluster’s lifetime is defined as the difference in its parent’s and its own ProtoLevel. Only clusters with a large enough lifetime remain after the condensation.

On the other hand, in advanced mode, the user may choose from various condensation parameters and their respective thresholds, yielding different forests of «trees».


Announcements

 

Related publications

  • N. Kaplan, O. Sasson, U. Inbar, M. Friedlich, M. Fromer, H. Fleischer, E. Portugaly, N. Linial, M. Linial
    ProtoNet 4.0: A hierarchical classification of one million protein sequences.
    Nucleic Acids Research, 2005, Vol. 33, Database issue
    PDF file [222KB]

  • N. Kaplan, M. Friedlich, M. Fromer, M. Linial
    A functional hierarchical organization of the protein sequence space.
    BMC Bioinformatics 2004, 5:196
    PDF file [281KB], Article URL

  • O. Sasson, H. Fleischer, E. Portugaly, Y. Bilu, N. Linial and M. Linial.
    ProtoNet: Navigating the Hierarchical Clustering of the Protein Space.
    PostScript file [800KB]
  • Sasson O, Linial N, Linial M.
    The metric space of proteins-comparative study of clustering algorithms.
    Bioinformatics. 2002 Jul; 18 Suppl 1:S14-21.
    Accepted to ISMB '02. [submitted]
    PostScript file [207KB], PDF file [396KB]