Assess a similarity measure of expression data

Introduction

It is commonly accepted that genes with similar expression profiles are functionally related. However, there are many ways one can measure the similarity of expression profiles, and it is not clear a priori what is the most effective one.

This server tests different similarity measures between expression profiles. It evaluates their effectiveness in detecting functionally related genes and the correlation with experimentally verified functional relationships extracted from pathway data, protein-protein interaction data, sequence data and promoter data. Our method is described in detail the following paper

Data sets

In our study we focus on three datasets, listed below. However, the methodology and the tools are applicable to other data sets, and we intend to link other data sets at a later date (contact us if you are interested in testing your measures on other expression data sets).

  • Time-series 1998 (Spellman et al. Mol Bio Cell 9:3273-3297, 1998). The time-series data set is available to download from the Yeast Cell Cycle Analyis Project webpage at Stanford. To make sure that you are using the same data set we were using, we strongly recommend that you will download our local copy. This copy contains only the time series data, i.e. data in columns 1,2,3,4,5,6,25,50,68 (that are labeled cln3-1, cln3-2, clb, clb2-2, clb2-1, alpha, cdc15, cdc28, and elu) is omitted. Blank entries (missing values) were assigned -666 as a place holder (you can use your own favorite method to handle missing data).

  • Rosetta-2000 (Hughes et al. Cell 102:109-126, 2000). Our local copy

  • Stress time-series 2004 (Shapira et al. Mol Biol Cell 15:5659-5669, 2004). Our local copy
In addition, to correlate this data set with the other data sets (sequence, protein-protein interactions, pathways and promoters) we reordered the rows in this file. Each line corresponds to one gene out of the 6298 genes in the yeast genome (see mapping from gene names to gene numbers). To be able to process your file, it is important that you will report the results using the gene numbers 1-6298. Other useful files: numbers to names, numbers to genbank (GI)

Note that some genes have no expression profiles (marked with a vector of -666). Altogether there are 5902 genes with expression profiles (get the list of these 5902 genes).

How to test your measure

To test your new measure of similarity, you have to download an expression data set and compare each pair of expression profiles (clearly, you can ignore missing expression profiles). Generate a sorted list of pairs, in this format. Finally, upload this list to the server. (the list doesn't have to be sorted, although it will expedite processing if it is). Note, only the top 20,000 similarities will be considered so you can upload a file with only the first 20,000 pairs.

Results

Our scripts will compare your method to other existing methods and will generate an ROC curve, as in this figure.