Problem How to find meaningful patterns in noisy microarray data and to formulate hypothesis explaining the observations?

Solution Search for statistically significant changes across appropriately defined groups of genes.

**More sensitive and robust approach than looking at independently defined “differentially expressed” genes**

G1

G2

G3

Group of entities (genes, metabolites, *etc*)

v1

v2

v3

Experimental measurements (expression, metabolomics, *etc*)

{v1,v2,v3} – collection of measured values for all entities in the group

frequency

abs (Log ratio)

Distribution on the array (sampling distribution)

Example 1: the collection of observations is random; group is insignificant

frequency

abs (Log ratio)

Distribution on the array (sampling distribution)

bias

Example 2: the observations are (overall) biased, group is significant

Compare the collection of the observed values V={v1, …, vn} to the set of all values S measured in the experiment.

*The probability that V is a random sample from S (p-value) quantifies the significance of the group of genes in the experiment. *

v1

… vn

v1

… vn

Distributions are generally non-Gaussian

How to quantify the bias and calculate the *p-*value? For a general sampling distribution (*e.g.* clearly non-Gaussian distribution of absolute values of log-ratio) *non-parametric* tests should be used.* *

Non-parametric representation (rank tests)

All observations, ordered

(*e.g.* by absolute log-ratio)

v1

… vn

observations within a group

All observations, ordered

(*e.g.* by absolute log-ratio)

v1

… vn

Group is insignificant

Group is significant (the ranks of measurements from the group are overall high in the ordered list)

Non-parametric tests:

- Kolmogorov-Smirnov (modified) GSEA
- Mann-Whitney Differentially Expressed Groups/Pathways (PE) These tests are capable of detecting subtle but consistent changes exhibited within a group (
*e.g.*a GO group) or a pathway

- •Groups/pathways in FDEG/FDEN tests are externally defined (
*e.g.*Gene Ontology, curated or user-defined pathways) - •Integration of the additional information provided by biomolecular network allows defining “generic” groups:
- –“Group”=all targets of transcriptional regulator
- –“Group”=all small molecules/metabolites affected by an enzyme or signaling molecule
- –Etc…
- •Targets of a particular regulator
*R*comprise a significant group = indication that*R*itself is “significant” (activated/inhibited, explains the observed pattern)

significant

insignificant

- •
**Single regulator**: changes in measured level of a target suggest involvement of the upstream regulator (A) - •
: not much information with respect to what particular regulator is involved (B)*Multiple regulators*

Target exhibiting large change

- •
*null*hypothesis: - –
*Connections*are made randomly - –
*Degree distribution*is preserved.

To get the distribution expected by chance

Break all the links in the whole network

Reconnect randomly and get the statistics

*Corollary:* during random network rewiring, each regulator “sees” all the edges ingoing into a particular target *T*. Probability to randomly reconnect to *T* is proportional to *indegree(T)*.* *

* *

*Sampling *a*pproximation (instead of brute force resampling):* Increased probability to reconnect to a “promiscuous” target *T* = increased probability to observe measurement associated with *T* downstream.

Proposition (used in NEA): instead of all measurements on the array, use effective sampling distribution, where measured value for each target *T* is replicated *indegree(T)* times.

© 2011 Ariadne. All Rights Reserved.

** Sub-Network Enrichment Analysis: **

**as good as your knowledge network database **

Lower p-value (more significant)

Higher p-value (less significant)

- •
**SNEA builds networks from all genes/proteins measured in the experiment using all relations in the database.** - •
**SNEA can include indirect regulation i.e. expression regulatory cascades consisting of 2-3 steps** - •
**Significant network centers may be found that are not measured in the primary dataset** - •
**No prior curation of gene sets is required.** - •
**Can work with partial information about TF targets. Does not require knowledge about all targets for TF** - •
**P-value is sensitive to the size of the chip**

**
Molecular networks in microarray analysis.
Sivachenko A, Yuryev A, Daraselia N, Mazo I. J Bioinform Comp. Biol. 2007 **

© 2011 Ariadne. All Rights Reserved.

- •Option “Expression targets” measures activity of expression regulator calculated from differential expression of its targets

- •Option “Binding partners” finds differentially expressed protein complexes
- •Option “miRNA targets” measures activity of microRNA calculated from differential expression of its targets
- •Option “Disease ExpressionChange targets” diagnoses disease or toxicity