Computational
methods for discovery of sequence elements that are enriched in a target set
compared to a background set are fundamental in molecular biology research. One
example is the discovery of transcription factor binding motifs that are
inferred from ChIP-chip (Chromatin Immuno-Precipitation on a microarray)
measurements.
Several major challenges in sequence motif discovery still require
consideration: (i) the need for a principled approach to partitioning the data
into target and background sets; (ii) the lack of rigorous models and of an
exact p-value for measuring motif enrichment; (iii) the need for an appropriate
framework for accounting for motif multiplicity; (iv) the tendency, in many of
the existing methods, to report presumably significant motifs even when applied
to randomly generated data. In this study we present a statistical framework
for discovering enriched sequence elements in ranked lists that resolves the
above four issues. Based on this framework we developed a software application,
termed DRIM (Discovery of Rank Imbalanced Motifs), which identifies sequence
motifs in lists of ranked DNA sequences. We applied DRIM to ChIP-chip and CpG
methylation data and obtained the following results: (i) Identification of 50 novel
putative transcription factor (TF) binding sites in yeast ChIP-chip data. The
biological function of some of them was further investigated and used in order
to gain new insights on transcription regulation networks in yeast. For
example, our discoveries enable the elucidation of the network of the TF ARO80.Another
finding concerns a systematic TF binding enhancement to sequences containing CA
repeats that suggests these repetitive elements play a mechanistic role in TF
binding. (ii) Discovery of novel motifs in human cancer CpG methylation data.
Remarkably, most of these motifs are similar to DNA sequence elements bound by
the Polycomb complex that promotes histone methylation. Our findings thus
support a model in which histone methylation and CpG methylation are
mechanistically linked. Overall, we demonstrate that our statistical framework
embodied in the DRIM software tool is highly effective for identifying
regulatory sequence elements in a variety of applications ranging from
expression and ChIP-chip to CpG methylation data. DRIM is publicly available at:
http://bioinfo.cs.technion.ac.il/drim.