Enrichment Analysis

Problem How to find meaningful patterns in noisy microarray data and to formulate hypothesis explaining the observations?

Solution Search for statistically significant changes across appropriately defined groups of genes.

More sensitive and robust approach than looking at independently defined “differentially expressed” genes

G1

G2

G3

Group of entities (genes, metabolites, etc)

v1

v2
v3

Experimental measurements (expression, metabolomics, etc)

{v1,v2,v3} – collection of measured values for all entities in the group

Quantifying “significance” of the group

frequency

abs (Log ratio)

Distribution on the array (sampling distribution)

Example 1: the collection of observations is random; group is insignificant

frequency

abs (Log ratio)

Distribution on the array (sampling distribution)

bias

Example 2: the observations are (overall) biased, group is significant

Compare the collection of the observed values V={v1, …, vn} to the set of all values S measured in the experiment.

The probability that V is a random sample from S (p-value) quantifies the significance of the group of genes in the experiment.

v1

… vn
v1

… vn

Distributions are generally non-Gaussian

Statistical significance tests

How to quantify the bias and calculate the p-value? For a general sampling distribution (e.g. clearly non-Gaussian distribution of absolute values of log-ratio) non-parametric tests should be used.

Non-parametric representation (rank tests)

All observations, ordered

(e.g. by absolute log-ratio)

v1

… vn

observations within a group

All observations, ordered

(e.g. by absolute log-ratio)

v1

… vn

Group is insignificant

Group is significant (the ranks of measurements from the group are overall high in the ordered list)

Non-parametric tests:
Kolmogorov-Smirnov (modified)  GSEA
Mann-Whitney  Differentially Expressed Groups/Pathways (PE) These tests are capable of detecting subtle but consistent changes exhibited within a group (e.g. a GO group) or a pathway

Network Enrichment Analysis

•Groups/pathways in FDEG/FDEN tests are externally defined (e.g. Gene Ontology, curated or user-defined pathways)
•Integration of the additional information provided by biomolecular network allows defining “generic” groups:
–“Group”=all targets of transcriptional regulator
–“Group”=all small molecules/metabolites affected by an enzyme or signaling molecule
–Etc…
•Targets of a particular regulator R comprise a significant group = indication that R itself is “significant” (activated/inhibited, explains the observed pattern)

significant

insignificant

Effect of Network Topology

Single regulator: changes in measured level of a target suggest involvement of the upstream regulator (A)
Multiple regulators: not much information with respect to what particular regulator is involved (B)
Sivachenko_Identification_Figure1

Target exhibiting large change

NEA incorporates connectivity corrections

null hypothesis:
Connections are made randomly
Degree distribution is preserved.

To get the distribution expected by chance

Break all the links in the whole network

Reconnect randomly and get the statistics

Corollary: during random network rewiring, each regulator “sees” all the edges ingoing into a particular target T. Probability to randomly reconnect to T is proportional to indegree(T).

Sampling approximation (instead of brute force resampling): Increased probability to reconnect to a “promiscuous” target T = increased probability to observe measurement associated with T downstream.

Proposition (used in NEA): instead of all measurements on the array, use effective sampling distribution, where measured value for each target T is replicated indegree(T) times.

AriadneLogo_onWhite

© 2011 Ariadne. All Rights Reserved.

Sub-Network Enrichment Analysis:

as good as your knowledge network database

Lower p-value (more significant)

Higher p-value (less significant)

SNEA builds networks from all genes/proteins measured in the experiment using all relations in the database.
SNEA can include indirect regulation i.e. expression regulatory cascades consisting of 2-3 steps
Significant network centers may be found that are not measured in the primary dataset
No prior curation of gene sets is required.
Can work with partial information about TF targets. Does not require knowledge about all targets for TF
P-value is sensitive to the size of the chip

Molecular networks in microarray analysis. Sivachenko A, Yuryev A, Daraselia N, Mazo I. J Bioinform Comp. Biol. 2007

AriadneLogo_onWhite

© 2011 Ariadne. All Rights Reserved.

Biological interpretation of SNEA options

•Option “Expression targets” measures activity of expression regulator calculated from differential expression of its targets

•Option “Binding partners” finds differentially expressed protein complexes
•Option “miRNA targets” measures activity of microRNA calculated from differential expression of its targets
•Option “Disease ExpressionChange targets” diagnoses disease or toxicity