]> 无标题文档

Gene expression analysis algorithms

General overview

Algorithms described in this section perform the analysis of gene expression data in the context of biomolecular network. Taking the information about differential expression between two experimental conditions as the input, these algorithms find the biological networks and pathways with proteins exhibiting the most significant changes in expression level.
Four expression analysis algorithms are implemented in Pathway Studio Enterprise. 

  • Gene set enrichment analysis (GSEA) evaluate the significance of expression changes among existing networks or groups (i.e. networks or groups already present in the database). The algorithms helps finding an existing pathway/group most likely affected in the experiment.
  • Sub-network enrichment analysis (SNEA) algorithm builds small networks consisting of a single “regulator” gene and its targets.  The significance of the target expression levels in every built network is evaluated next.  SNEA algorithm finds the individual “regulators”, which most likely affect differentially expressed targets.  Thus it provides the most plausible explanation for the observed expression changes.

All algorithms do not require any user-specified p-value or fold-change cutoff to define differentially expressed genes. Instead, algorithms use the complete distribution of measured expression level log-ratios, as described below.  This results in much better test sensitivity and robustness.
The algorithms can be applied only to expression experiments that have a column with log-ratios for differential expression. To use the algorithms, press the ‘Analyze experiment’ button from the Expression experiment view, or from the Explorer view displaying expression experiments (one experiment must be selected).  GSEA, FDEG, SNEA/BDEN algorithm must be selected in the ‘Analysis type’ combo box, and the name of the log-ratio column in the experimental data should be set in the ‘Sample column name’ combo box.  The number and quality of the networks returned is controlled by the ‘p-value Range’ and ‘Max networks’ parameters (restrict the result to networks within specified p-value range or to the specified number of top-scoring networks, respectively). If the additional parameter ‘Genes of interest’ is specified, the output will be further filtered to display only the networks containing selected genes of interest.

Gene Set Enrichment analysis using pathway collections or ontologies in the database.

There are no algorithm-specific additional parameters for this algorithm.
The algorithm scans all networks (pathways) in the database and uses the expression values available for every network from the experiment being analyzed (missing values are allowed).  The network must contain at least one protein with measured expression value.  The null hypothesis assumes that there is no association between log-ratio and the fact that the gene belongs to the network under consideration.  In other words, the collection of log-ratio values observed for the genes in the network is assumed to be a random sample from the distribution of all log-ratios measured by the experiment.  Non-parametric one-sided Mann-Whitney U-test is used to calculate the probability that a sample is randomly drawn from the log-ratio distribution.  This probability is the p-value assigned to a network by the GSEA algorithm.  Small p-values indicate that the sample of genes present in a network have statistically significant elevated absolute log-ratios as compared to the distribution of all log-ratios measured in the experiment (see Figure 1).

The output page of the GSEA algorithm shows the networks sorted by their p-value in the ascending order, along with the total numbers of genes in each network and total numbers of genes with measured expression values.

Sub-network enrichment analysis algorithm.

To start SNEA algorithm an additional parameter should be specified – relation type(s) (set of checkboxes becomes visible when Algorithm type is set to SNEA). Allowed relation types are: PromoterBinding, Regulation, DirectRegulation, Expression, ProtModification, MolSynthesis, and MolTransport. All the relations of the specified type(s) found in the database will be used by the algorithm.   We recommend starting the analysis with the smallest and also the most reliable network of PromoterBinding relations.  If the results appear non-satisfactory add Expression relations and then Regulation relation.  Other type of relation are irrelevant for analysis of Expression data but can be used for analysis of proteomics data such as phosphoprofiles (add ProtModification); protein level or enzymatic activity measurements (add DirectRegulation, MolTransport); metabolic profiles (add MolSynthesis and MolTransport).
The SNEA algorithm scans all genes in the entire network database defined by the specified relation type(s).  For each “regulator” gene (i.e. a gene that has at least one outgoing relation of the specified type), all its “target” genes are selected (all the genes that have relations of the specified types from the “regulator”; the regulatory self-loops are discarded – a regulator gene does not appear among its own targets).  Next, the absolute values of the expression log-ratios for all targets are extracted (missing values are allowed.  Also note that the expression data for the regulator is not used).  The null hypothesis is the same as in the GSEA algorithm: the sample, which now consists of the log-ratios from targets, is drawn randomly from the distribution.  In other words, the null states that the collection of targets for a given regulator does not exhibit any unusual behavior as compared to the entire expression dataset.  Mann-Whitney test is used to evaluate the significance (p-value) of the deviation of the sample from the distribution (Figure 1).  Thus, in the SNEA algorithm, smaller p-values indicate more significant deviations of the target log-ratio distribution from the entire dataset distribution.
The output page of the SNEA algorithm shows generated networks, each consisting of a regulator and all its targets.  The networks are named after the regulator gene.  Networks are sorted by p-values in the ascending order, and the counts of the total number of targets and of the number of targets with log-ratio available from the experimental data are shown along with each network.
Depending on the relation type(s), the result can have different biological interpretations. For instance, if the analysis is restricted to PromoterBinding relations, then the top-scoring networks reveal transcription factors (“regulators”) with significant changes in the downstream target expression levels, thus providing the most plausible upstream regulator candidates that were induced in the experiment.  If ProtModification was specified for analysis of phosphoprofiles, then the results are interpreted as the kinases responsible for phosphorylating the protein targets.  If MolSynthesis was chosen to explain metabolic profiles the results are interpreted as enzymes producing most affected metabolites.  If MolTtransport was chosen to explain metabolic profiles the results are interpreted as transporters exporting/importing most affected metabolites.

Sub-network enrichment analysis sampling distribution

The FSN algorithm uses a special sampling distribution.  In existing pathways genes are grouped together based on external information and generally speaking regardless the fact whether they have relation to each other.  In contrast, the regulator-target relations used by the FSN algorithm are provided by the network itself.  This affects the expected distribution of target expression log-ratios under the null hypothesis.  Randomization is required that breaks all the relations (edges in the graphical network representation), and then randomly reconnects the dangling edge halves (Figure 2). Such randomization approach is called stump reconnection.  In other words, the regulator “forgets” all n real targets and picks randomly n new targets from the entire set of all targets.  Importantly, during the stump reconnection, the regulator actually “sees” not the targets, but the broken relation ends it reconnects to.  In FSN algorithm this randomization process is approximated through sampling distribution to accelerate the algorithm performance.  Because a chance to reconnect to the target t is proportional to the total number m(t) of relations, where t is a target, the sampling distribution is built from the expression log-ratios of all genes using the value for each gene g replicated m(g) times.  The advantages of such approach are demonstrated in the figure below.  Consider a target that is downstream of multiple regulators in the network.  Even if this target exhibits very high or very low log-ratio, it still adds little confidence to each individual “regulator-targets” network.  Inflating the sampling distribution with the replicas helps to suppress the bias due to such promiscuously regulated genes.

Sub-network enrichment analysis algorithm.

To start SNEA algorithm an additional parameter should be specified – relation type(s) (set of checkboxes becomes visible when Algorithm type is set to SNEA). Allowed relation types are: PromoterBinding, Regulation, DirectRegulation, Expression, ProtModification, MolSynthesis, and MolTransport. All the relations of the specified type(s) found in the database will be used by the algorithm.   We recommend starting the analysis with the smallest and also the most reliable network of PromoterBinding relations.  If the results appear non-satisfactory add Expression relations and then Regulation relation.  Other type of relation are irrelevant for analysis of Expression data but can be used for analysis of proteomics data such as phosphoprofiles (add ProtModification); protein level or enzymatic activity measurements (add DirectRegulation, MolTransport); metabolic profiles (add MolSynthesis and MolTransport).
The SNEA algorithm scans all genes in the entire network database defined by the specified relation type(s).  For each “regulator” gene (i.e. a gene that has at least one outgoing relation of the specified type), all its “target” genes are selected (all the genes that have relations of the specified types from the “regulator”; the regulatory self-loops are discarded – a regulator gene does not appear among its own targets).  Next, the absolute values of the expression log-ratios for all targets are extracted (missing values are allowed.  Also note that the expression data for the regulator is not used).  The null hypothesis is the same as in the GSEA algorithm: the sample, which now consists of the log-ratios from targets, is drawn randomly from the distribution.  In other words, the null states that the collection of targets for a given regulator does not exhibit any unusual behavior as compared to the entire expression dataset.  Mann-Whitney test is used to evaluate the significance (p-value) of the deviation of the sample from the distribution (Figure 1).  Thus, in the SNEA algorithm, smaller p-values indicate more significant deviations of the target log-ratio distribution from the entire dataset distribution.
The output page of the SNEA algorithm shows generated networks, each consisting of a regulator and all its targets.  The networks are named after the regulator gene.  Networks are sorted by p-values in the ascending order, and the counts of the total number of targets and of the number of targets with log-ratio available from the experimental data are shown along with each network.
Depending on the relation type(s), the result can have different biological interpretations. For instance, if the analysis is restricted to PromoterBinding relations, then the top-scoring networks reveal transcription factors (“regulators”) with significant changes in the downstream target expression levels, thus providing the most plausible upstream regulator candidates that were induced in the experiment.  If ProtModification was specified for analysis of phosphoprofiles, then the results are interpreted as the kinases responsible for phosphorylating the protein targets.  If MolSynthesis was chosen to explain metabolic profiles the results are interpreted as enzymes producing most affected metabolites.  If MolTtransport was chosen to explain metabolic profiles the results are interpreted as transporters exporting/importing most affected metabolites.

Sub-network enrichment analysis sampling distribution

The FSN algorithm uses a special sampling distribution.  In existing pathways genes are grouped together based on external information and generally speaking regardless the fact whether they have relation to each other.  In contrast, the regulator-target relations used by the FSN algorithm are provided by the network itself.  This affects the expected distribution of target expression log-ratios under the null hypothesis.  Randomization is required that breaks all the relations (edges in the graphical network representation), and then randomly reconnects the dangling edge halves (Figure 2). Such randomization approach is called stump reconnection.  In other words, the regulator “forgets” all n real targets and picks randomly n new targets from the entire set of all targets.  Importantly, during the stump reconnection, the regulator actually “sees” not the targets, but the broken relation ends it reconnects to.  In FSN algorithm this randomization process is approximated through sampling distribution to accelerate the algorithm performance.  Because a chance to reconnect to the target t is proportional to the total number m(t) of relations, where t is a target, the sampling distribution is built from the expression log-ratios of all genes using the value for each gene g replicated m(g) times.  The advantages of such approach are demonstrated in the figure below.  Consider a target that is downstream of multiple regulators in the network.  Even if this target exhibits very high or very low log-ratio, it still adds little confidence to each individual “regulator-targets” network.  Inflating the sampling distribution with the replicas helps to suppress the bias due to such promiscuously regulated genes.

Find Similar Pathways/Groups. 

Evaluation of statistical significance of the overlap between pathways/groups.

Please note that the Find relevant groups/networks algorithms are also available in Pathway Studio Desktop and Workgroup editions by the name “Find groups/Find pathways.
For the purposes of the statistical significance evaluation, the only principal distinction between the concepts of “pathway” and “group” is that “pathway” (and thus the intersection of pathways, too) consists of distinct objects, while “group” refers to objects possessing the same “property”, or “label”. Still, any pathway/group is a set of objects, or sample, S, of size N=||S||, drawn from the pool of all available objects, P.

1. Pathways.

Before starting the calculations, one should note that the pool, P, and thus the sample, S, can consist of objects of more than one distinct kind (i.e. proteins, small molecules, etc), in which case significance evaluation can be performed separately for each object type. For this reason we start from considering the situation when objects of only one type are available (i.e. proteins). Note also that treatment of two objects as belonging to “one type” is purely conceptual and, as a matter of fact, is a function of particular question at hand. Formally, given the particular question, a pool of all available/relevant objects must be determined, from which any given object can be drawn with the same probability (assuming blind random drawing with no bias due to additional external information) – it is natural, for example, to assume that if blind random drawing is performed, any protein can be drawn from the set of all proteins, and/or any small molecule can be drawn from the set of all small molecules, but it is misleading to consider drawing of “a small molecule or a protein” from the pool of all proteins and molecules.

Let us now consider two pathways, S1, S2, which consist of N1=||S1|| and N2=||S2|| objects (of the same kind), respectively, and have  objects in common. We are interested in evaluating the statistical significance of the intersection, R, between the two pathways. The null hypothesis is that there is no statistically significant association between pathways 1 and 2, so that the intersection, if any, is the result of purely random coincidence. Thus, we should consider two random independent samples of sizes N1 and N2 drawn from the pool of size Ntotal (see Fig. 1). The probability to observe R or more common objects in such two random samples will be exactly the p-value we want to learn; the smaller the p-value, the less is the probability that given pathway overlap could occur by random chance.

The first drawing selects N1 arbitrary objects, so that the corresponding probability to select just any N1 objects is 1, this is just the condition we start our calculations with. Given this initial condition we can count the number of realizations that result in the situation depicted in Fig 1. Namely, we have to draw N2-R objects from total of Ntotal-N1 remaining objects that do not belong to the set 1, and then R objects from N1 objects that belong to the set 1. The total number of distinct realizations of the former drawing is

The total number of distinct samples that result exactly in distribution of object counts among the two sets shown in Figure 1 is given by the product of the two expressions above. Since any given sample may be realized with equal probability, the total probability to observe the counts we are interested in is equal to the number of realizations resulting in these counts divided by the total number of distinct samples of size N2 that can be drawn from the pool of Ntotal objects:

where p(R|N1,N2) is the conditional probability to observe overlap of size R between the two random samples, provided the sizes of the samples are N1 and N2. Clearly, the resulting expression is symmetric with respect to N1 and N2. Note that the expression above is exact. As a matter of fact, upon properly defining contingency table, one can see that the expression above represents the Fisher exact test. If the sample sizes are small compared to the total number of available objects and intersection size is small compared to sample sizes, , then binomial (Bernoulli) model holds approximately true. Indeed, introducing probability to draw any object belonging to set 1 from the pool, p=N1/Ntotal, the expression above can be approximated as

The exact expression, however, is quite feasible for numerical evaluation and it is advisable to use it at all times. Regardless of the particular formula used to evaluate p(R|N1,N2), the (one-sided) p-value quantifying the probability to observe an overlap of size R or large by merely a random chance is

Finally, if objects of more than one type exist in the pathways, individual p-values for objects of each kind can be evaluated, and the overall p-value is the product of these individual object type-specific values. Note, that individual p-valuesmay provide valuable insight into the pathway function: for instance, if the protein overlap is significant, while at the same time the small molecule overlap is not, this may indicate similar protein machinery processing different chemicals at different times into the cell cycle, or in response to different stimuli, i.e. potential multi-functionality of the pathway core.

2. Groups.

Now let us turn to the evaluation of statistical significance of “groups”, or in other words, of counts of particular labels assigned to objects in a sample S. This problem can be actually mapped directly to the pathway overlap problem that was solved in the previous section. Indeed, consider a pool, P, of Ntotal objects, and let us assume that a subset S1 of P possesses a particular property or is assigned a particular label. More formally, we can introduce a categorical or indicator variable that takes exactly two values, with S1 being the domain, on which and only on which one of the values is taken. Within this framework, the “label” assigned to all elements of the set S1 is “pathway 1”.
The sample S2 is drawn from P, and under the null hypothesis there is no association between the fact that an object belongs to S2 and assignment of a particular label (i.e. belonging to the set S1) to this object. To evaluate the p-value under such null, we should calculate the probability for a random sample S2 drawn from P to have an overlap of size R with the predefined set S1. This situation is illustrated again by the Figure 1, with the only exception that the set S1 is “predefined” and fixed. This, however, does not make any formal difference since in the previous section we assigned probability of 1 to the set S1 itself (i.e. pathway 1 was selected and “fixed” and it was the probability for the second pathway S2 to overlap with that fixed set that was calculated). It should be clear now that expression (1) from the previous section provides an exact answer to the problem of calculation of the significance level of any particular label representation in a given set S2.
The parameters of the expression (1) should be treated as follows: Ntotal – total number of objects (i.e. proteins), N1 – total number of objects with particular property/label A (i.e. particular GO annotation), N2 – number of objects in the set under the study, S2 (i.e. pathway, set of differentially expressed genes, etc.), R – number of objects with property/label A in the set S2.
Finally, when the partial probabilities p(R|N1,N2) are calculated, the p­-value is given by the same expression (2) as before.

 Calculation of Ntotal for Fisher Exact test in Pathway Studio
“Find similar pathways/groups” calculates total pool size as total number of all entities of all types that belong to each of the ontologies selected for analysis. For example, if both “Ariadne pathways” and “Gene Ontology/Biological processes” were specified for analysis Ntotal is calculated separately for each of them as total number of proteins, small molecules, functional classes, complexes and all other entities that are contained in all pathways in Ariadne pathway collection or in all groups that belong to Biological processes branch of Gene Ontology.
“Find sub-network enriched with selected entities” calculates total pool size as total number of entities in the preset specified by filter settings in the dialog.  The following table describes the function of the preset computations in Pathway Studio.