Research projects

The focus areas of my group are Computational Molecular Biology and Machine Learning. We are working on large scale analysis of protein sequences and structures, exploring high-order organization within the protein space. Other research interests are mathematical and statistical models of protein families, algorithms for protein sequence and structure comparison, structural genomics. Main research projects (partial list):

Biozon: A new generation of biological databases.
The function of genes depends on their extended biological context - their relations to other genes, the set of interactions they form, the pathways they participate in, their subcellular location, and so on. In this view, there is a growing need to corroborate and integrate data from different resources and aspects of biological systems in order to analyze effectively new genes. Addressing this urgent need, the aim of the BIOZON project is to construct a new unified biological resource and a comprehensive protein and DNA characterization, classification and management system that analyzes biological entities from genes to protein families, biochemical pathways and organisms. BIOZON is based on an extensive database schema that integrates information at the macro-molecular level as well as at the cellular level, from a variety of resources.

This resource already stores extensive information about more than 40,000,000 protein and DNA sequences (integrating sequence, structure, protein-protein interactions, pathways and expression data) totaling to about 60 million documents from several different databases as well as from in-house computations, and 2.5 billion relations between documents (including explicit relations between objects, and derived or computed relations based on sequence similarities, profile-profile similarities, structural similarities and more). A preliminary beta version of this knowledge resource is accessible at biozon.org. Other existing data types will be integrated gradually (by hosting other databases). Since new technologies keep generating new data types, the database was designed as general as possible, to allow easy integration of future databases. The ultimate goal of the Biozon project is to make all its data readily available to the whole scientific community, and gradually also the means for others to integrate their data.

One of the unique aspects of BIOZON is that it allows complex queries that span multiple data types (e.g. a protein sequence, a structure, and a pathway). For example, using the web interface, one can easily form a query that will return all proteins that are known to participate in known pathways and have a solved 3D structure. Or, one can ask for all protein structures of proteins that are involved in known interactions, and so on. Furthermore, the indexing of similarity data allows us to explore new methods of querying data, and implement fuzzy searches that consider similarity relations. The complex and tightly connected infrastructure is used also to propagate information between related biological entities, to rank queries, and to extend functional predictions.

ProtoMap: automatic classification of proteins
ProtoMap is a global organization of all protein sequences in the swissprot and trembl databases, based on graph representation of the sequence space. Visit the ProtoMap (Cornell), ProtoMap (Stanford), ProtoMap (Israel) interactive website.

BioSpace: A Unified Sequence-Structure Classification of Proteins.
This is a multi-level analysis of proteins combining sequence and structure information, yielding a sequence/structure consistent map of the protein space and 3D models for over 160,000 protein sequences. Partial results of our analysis as well as the 3D models are accessible at the (preliminary release) of the BioSpace website.

More is coming soon..