Gene expression analysis algorithms |
General overviewAlgorithms described in this section perform the analysis of gene expression data in the context of biomolecular network. Taking the information about differential expression between two experimental conditions as the input, these algorithms find the biological networks and pathways with proteins exhibiting the most significant changes in expression level.
All algorithms do not require any user-specified p-value or fold-change cutoff to define differentially expressed genes. Instead, algorithms use the complete distribution of measured expression level log-ratios, as described below. This results in much better test sensitivity and robustness. Gene Set Enrichment analysis using pathway collections or ontologies in the database.There are no algorithm-specific additional parameters for this algorithm. |
The output page of the GSEA algorithm shows the networks sorted by their p-value in the ascending order, along with the total numbers of genes in each network and total numbers of genes with measured expression values. |
Sub-network enrichment analysis algorithm.To start SNEA algorithm an additional parameter should be specified – relation type(s) (set of checkboxes becomes visible when Algorithm type is set to SNEA). Allowed relation types are: PromoterBinding, Regulation, DirectRegulation, Expression, ProtModification, MolSynthesis, and MolTransport. All the relations of the specified type(s) found in the database will be used by the algorithm. We recommend starting the analysis with the smallest and also the most reliable network of PromoterBinding relations. If the results appear non-satisfactory add Expression relations and then Regulation relation. Other type of relation are irrelevant for analysis of Expression data but can be used for analysis of proteomics data such as phosphoprofiles (add ProtModification); protein level or enzymatic activity measurements (add DirectRegulation, MolTransport); metabolic profiles (add MolSynthesis and MolTransport). Sub-network enrichment analysis sampling distributionThe FSN algorithm uses a special sampling distribution. In existing pathways genes are grouped together based on external information and generally speaking regardless the fact whether they have relation to each other. In contrast, the regulator-target relations used by the FSN algorithm are provided by the network itself. This affects the expected distribution of target expression log-ratios under the null hypothesis. Randomization is required that breaks all the relations (edges in the graphical network representation), and then randomly reconnects the dangling edge halves (Figure 2). Such randomization approach is called stump reconnection. In other words, the regulator “forgets” all n real targets and picks randomly n new targets from the entire set of all targets. Importantly, during the stump reconnection, the regulator actually “sees” not the targets, but the broken relation ends it reconnects to. In FSN algorithm this randomization process is approximated through sampling distribution to accelerate the algorithm performance. Because a chance to reconnect to the target t is proportional to the total number m(t) of relations, where t is a target, the sampling distribution is built from the expression log-ratios of all genes using the value for each gene g replicated m(g) times. The advantages of such approach are demonstrated in the figure below. Consider a target that is downstream of multiple regulators in the network. Even if this target exhibits very high or very low log-ratio, it still adds little confidence to each individual “regulator-targets” network. Inflating the sampling distribution with the replicas helps to suppress the bias due to such promiscuously regulated genes. Sub-network enrichment analysis algorithm.To start SNEA algorithm an additional parameter should be specified – relation type(s) (set of checkboxes becomes visible when Algorithm type is set to SNEA). Allowed relation types are: PromoterBinding, Regulation, DirectRegulation, Expression, ProtModification, MolSynthesis, and MolTransport. All the relations of the specified type(s) found in the database will be used by the algorithm. We recommend starting the analysis with the smallest and also the most reliable network of PromoterBinding relations. If the results appear non-satisfactory add Expression relations and then Regulation relation. Other type of relation are irrelevant for analysis of Expression data but can be used for analysis of proteomics data such as phosphoprofiles (add ProtModification); protein level or enzymatic activity measurements (add DirectRegulation, MolTransport); metabolic profiles (add MolSynthesis and MolTransport). Sub-network enrichment analysis sampling distributionThe FSN algorithm uses a special sampling distribution. In existing pathways genes are grouped together based on external information and generally speaking regardless the fact whether they have relation to each other. In contrast, the regulator-target relations used by the FSN algorithm are provided by the network itself. This affects the expected distribution of target expression log-ratios under the null hypothesis. Randomization is required that breaks all the relations (edges in the graphical network representation), and then randomly reconnects the dangling edge halves (Figure 2). Such randomization approach is called stump reconnection. In other words, the regulator “forgets” all n real targets and picks randomly n new targets from the entire set of all targets. Importantly, during the stump reconnection, the regulator actually “sees” not the targets, but the broken relation ends it reconnects to. In FSN algorithm this randomization process is approximated through sampling distribution to accelerate the algorithm performance. Because a chance to reconnect to the target t is proportional to the total number m(t) of relations, where t is a target, the sampling distribution is built from the expression log-ratios of all genes using the value for each gene g replicated m(g) times. The advantages of such approach are demonstrated in the figure below. Consider a target that is downstream of multiple regulators in the network. Even if this target exhibits very high or very low log-ratio, it still adds little confidence to each individual “regulator-targets” network. Inflating the sampling distribution with the replicas helps to suppress the bias due to such promiscuously regulated genes. Find Similar Pathways/Groups.Evaluation of statistical significance of the overlap between pathways/groups.Please note that the Find relevant groups/networks algorithms are also available in Pathway Studio Desktop and Workgroup editions by the name “Find groups/Find pathways. 1. Pathways. Before starting the calculations, one should note that the pool, P, and thus the sample, S, can consist of objects of more than one distinct kind (i.e. proteins, small molecules, etc), in which case significance evaluation can be performed separately for each object type. For this reason we start from considering the situation when objects of only one type are available (i.e. proteins). Note also that treatment of two objects as belonging to “one type” is purely conceptual and, as a matter of fact, is a function of particular question at hand. Formally, given the particular question, a pool of all available/relevant objects must be determined, from which any given object can be drawn with the same probability (assuming blind random drawing with no bias due to additional external information) – it is natural, for example, to assume that if blind random drawing is performed, any protein can be drawn from the set of all proteins, and/or any small molecule can be drawn from the set of all small molecules, but it is misleading to consider drawing of “a small molecule or a protein” from the pool of all proteins and molecules. Let us now consider two pathways, S1, S2, which consist of N1=||S1|| and N2=||S2|| objects (of the same kind), respectively, and have objects in common. We are interested in evaluating the statistical significance of the intersection, R, between the two pathways. The null hypothesis is that there is no statistically significant association between pathways 1 and 2, so that the intersection, if any, is the result of purely random coincidence. Thus, we should consider two random independent samples of sizes N1 and N2 drawn from the pool of size Ntotal (see Fig. 1). The probability to observe R or more common objects in such two random samples will be exactly the p-value we want to learn; the smaller the p-value, the less is the probability that given pathway overlap could occur by random chance. The first drawing selects N1 arbitrary objects, so that the corresponding probability to select just any N1 objects is 1, this is just the condition we start our calculations with. Given this initial condition we can count the number of realizations that result in the situation depicted in Fig 1. Namely, we have to draw N2-R objects from total of Ntotal-N1 remaining objects that do not belong to the set 1, and then R objects from N1 objects that belong to the set 1. The total number of distinct realizations of the former drawing is The total number of distinct samples that result exactly in distribution of object counts among the two sets shown in Figure 1 is given by the product of the two expressions above. Since any given sample may be realized with equal probability, the total probability to observe the counts we are interested in is equal to the number of realizations resulting in these counts divided by the total number of distinct samples of size N2 that can be drawn from the pool of Ntotal objects: where p(R|N1,N2) is the conditional probability to observe overlap of size R between the two random samples, provided the sizes of the samples are N1 and N2. Clearly, the resulting expression is symmetric with respect to N1 and N2. Note that the expression above is exact. As a matter of fact, upon properly defining contingency table, one can see that the expression above represents the Fisher exact test. If the sample sizes are small compared to the total number of available objects and intersection size is small compared to sample sizes, , then binomial (Bernoulli) model holds approximately true. Indeed, introducing probability to draw any object belonging to set 1 from the pool, p=N1/Ntotal, the expression above can be approximated as The exact expression, however, is quite feasible for numerical evaluation and it is advisable to use it at all times. Regardless of the particular formula used to evaluate p(R|N1,N2), the (one-sided) p-value quantifying the probability to observe an overlap of size R or large by merely a random chance is Finally, if objects of more than one type exist in the pathways, individual p-values for objects of each kind can be evaluated, and the overall p-value is the product of these individual object type-specific values. Note, that individual p-valuesmay provide valuable insight into the pathway function: for instance, if the protein overlap is significant, while at the same time the small molecule overlap is not, this may indicate similar protein machinery processing different chemicals at different times into the cell cycle, or in response to different stimuli, i.e. potential multi-functionality of the pathway core. 2. Groups. Now let us turn to the evaluation of statistical significance of “groups”, or in other words, of counts of particular labels assigned to objects in a sample S. This problem can be actually mapped directly to the pathway overlap problem that was solved in the previous section. Indeed, consider a pool, P, of Ntotal objects, and let us assume that a subset S1 of P possesses a particular property or is assigned a particular label. More formally, we can introduce a categorical or indicator variable that takes exactly two values, with S1 being the domain, on which and only on which one of the values is taken. Within this framework, the “label” assigned to all elements of the set S1 is “pathway 1”. Calculation of Ntotal for Fisher Exact test in Pathway Studio |