Consider a differential expression experiment where two different datasets are used to identify DE genes. The DE genes from each experiment may be depicted as subsets (A and B) of all possible genes (G), as depicted in Figure 1. We indicate the number of genes in the sets A, B, and G as \(n_1\), \(n_2\), and \(N\), respectively.

G represents the space of all possible genes. “All possible genes” may mean different things based on the context. For example, not every gene will be detected in an RNA-Seq experiment, and so you could argue that \(N\) should be the number of genes detected, rather than the number of genes in the whole genome. Without a priori knowledge like this, \(N\) should be set to the number of genes in the genome. Regardless, for genome-wide experiments, setting \(N\) to the total number of genes in the genome is usually a defensible choice.

Figure 1

If you threw a dart at Figure 1 (and were guaranteed to hit it), the probability of hitting the area of A is \(p_A = n_1/N\), and likewise the probability of hitting B is \(p_B = n_2/N\). It follows that the probability of hitting the intersection is \(p_{A\cap B} = p_Ap_B = (n_1/N)(n_2/N)\). Therefore, the number of genes that corresponds to the intersection is \(n_{A\cap B} = Np_{A\cap B}\) (they use \(n_0\) in the text).

If A and B are independent, there is still some probability they will overlap by chance. This probability is proportional to the size of the sets, where larger sets will result in a higher probability of overlap by chance, since they comprise a larger proportion of the overall set G. Consider Figure 2, where the size of the sets, and therefore the size of the intersection, is enlarged compared with Figure 1. Some number \(n_r\) of the intersecting genes have occurred by chance. The total number of observed intersecting genes is therefore composed of a “background” part \(n_r\) and a “true” part \(n_x\) (they use \(x\) as \(n_x\) in the text), \(n_{A\&B} = n_r + n_x\).

Figure 2

Since the probability of observing overlap by chance is dependent upon the size of the sets \(n_1\) and \(n_2\), the authors reason that they can compute \(n_x\) by examining the probability of intersection for the set of genes that results after removing the size of the “true” overlap.

\[ n_0 = n_x + \frac{(n_1 - n_x)(n_2 - n_x)}{N-n_x} \]

Since \(n_0, n_1, n_2,\) and \(N\) are known, we can solve this equation for \(n_x\) to find the background corrected number of “true” overlapping genes.

\[ n_x = \frac{Nn_0-n_1n_2}{n_0+N-n_1-n_2} \]

In other words, the number of genes we expect to overlap by chance depends upon the number of total genes, the size of the observed overlap, and the size of each set.

Note the effect of background correction is subtle for small values of \(n_1\) and \(n_2\). Below are plots of observed (\(n_{A\cap B}\)) versus “true” overlap (\(n_x\)) using the formula presented in the paper for different values of \(n_1\) and \(n_2\), given a constant \(N\) of 20,000 genes.

LS0tDQp0aXRsZTogIldhbmcgZXQgYWwgMjAxNCBDb25jb3JkYW5jZSBDYWxjdWxhdGlvbiINCm91dHB1dDoNCiAgaHRtbF9ub3RlYm9vazogZGVmYXVsdA0KICBwZGZfZG9jdW1lbnQ6IGRlZmF1bHQNCi0tLQ0KDQpDb25zaWRlciBhIGRpZmZlcmVudGlhbCBleHByZXNzaW9uIGV4cGVyaW1lbnQgd2hlcmUgdHdvIGRpZmZlcmVudCBkYXRhc2V0cyBhcmUgdXNlZCB0byBpZGVudGlmeSBERSBnZW5lcy4gVGhlIERFIGdlbmVzIGZyb20gZWFjaCBleHBlcmltZW50IG1heSBiZSBkZXBpY3RlZCBhcyBzdWJzZXRzIChBIGFuZCBCKSBvZiBhbGwgcG9zc2libGUgZ2VuZXMgKEcpLCBhcyBkZXBpY3RlZCBpbiBGaWd1cmUgMS4gV2UgaW5kaWNhdGUgdGhlIG51bWJlciBvZiBnZW5lcyBpbiB0aGUgc2V0cyBBLCBCLCBhbmQgRyBhcyAkbl8xJCwgJG5fMiQsIGFuZCAkTiQsIHJlc3BlY3RpdmVseS4NCg0KRyByZXByZXNlbnRzIHRoZSBzcGFjZSBvZiBhbGwgcG9zc2libGUgZ2VuZXMuICJBbGwgcG9zc2libGUgZ2VuZXMiIG1heSBtZWFuIGRpZmZlcmVudCB0aGluZ3MgYmFzZWQgb24gdGhlIGNvbnRleHQuIEZvciBleGFtcGxlLCBub3QgZXZlcnkgZ2VuZSB3aWxsIGJlIGRldGVjdGVkIGluIGFuIFJOQS1TZXEgZXhwZXJpbWVudCwgYW5kIHNvIHlvdSBjb3VsZCBhcmd1ZSB0aGF0ICROJCBzaG91bGQgYmUgdGhlIG51bWJlciBvZiBnZW5lcyBkZXRlY3RlZCwgcmF0aGVyIHRoYW4gdGhlIG51bWJlciBvZiBnZW5lcyBpbiB0aGUgd2hvbGUgZ2Vub21lLiBXaXRob3V0IGEgcHJpb3JpIGtub3dsZWRnZSBsaWtlIHRoaXMsICROJCBzaG91bGQgYmUgc2V0IHRvIHRoZSBudW1iZXIgb2YgZ2VuZXMgaW4gdGhlIGdlbm9tZS4gUmVnYXJkbGVzcywgZm9yIGdlbm9tZS13aWRlIGV4cGVyaW1lbnRzLCBzZXR0aW5nICROJCB0byB0aGUgdG90YWwgbnVtYmVyIG9mIGdlbmVzIGluIHRoZSBnZW5vbWUgaXMgdXN1YWxseSBhIGRlZmVuc2libGUgY2hvaWNlLg0KDQohW0ZpZ3VyZSAxXShjb25jb3JkYW5jZV9maWd1cmUxLnBuZyl7d2lkdGg9MjUlfQ0KDQpJZiB5b3UgdGhyZXcgYSBkYXJ0IGF0IEZpZ3VyZSAxIChhbmQgd2VyZSBndWFyYW50ZWVkIHRvIGhpdCBpdCksIHRoZSBwcm9iYWJpbGl0eSBvZiBoaXR0aW5nIHRoZSBhcmVhIG9mIEEgaXMgJHBfQSA9IG5fMS9OJCwgYW5kIGxpa2V3aXNlIHRoZSBwcm9iYWJpbGl0eSBvZiBoaXR0aW5nIEIgaXMgJHBfQiA9IG5fMi9OJC4gSXQgZm9sbG93cyB0aGF0IHRoZSBwcm9iYWJpbGl0eSBvZiBoaXR0aW5nIHRoZSBpbnRlcnNlY3Rpb24gaXMgJHBfe0FcY2FwIEJ9ID0gcF9BcF9CID0gKG5fMS9OKShuXzIvTikkLiBUaGVyZWZvcmUsIHRoZSBudW1iZXIgb2YgZ2VuZXMgdGhhdCBjb3JyZXNwb25kcyB0byB0aGUgaW50ZXJzZWN0aW9uIGlzICRuX3tBXGNhcCBCfSA9IE5wX3tBXGNhcCBCfSQgKHRoZXkgdXNlICRuXzAkIGluIHRoZSB0ZXh0KS4NCg0KSWYgQSBhbmQgQiBhcmUgaW5kZXBlbmRlbnQsIHRoZXJlIGlzIHN0aWxsIHNvbWUgcHJvYmFiaWxpdHkgdGhleSB3aWxsIG92ZXJsYXAgYnkgY2hhbmNlLiBUaGlzIHByb2JhYmlsaXR5IGlzIHByb3BvcnRpb25hbCB0byB0aGUgc2l6ZSBvZiB0aGUgc2V0cywgd2hlcmUgbGFyZ2VyIHNldHMgd2lsbCByZXN1bHQgaW4gYSBoaWdoZXIgcHJvYmFiaWxpdHkgb2Ygb3ZlcmxhcCBieSBjaGFuY2UsIHNpbmNlIHRoZXkgY29tcHJpc2UgYSBsYXJnZXIgcHJvcG9ydGlvbiBvZiB0aGUgb3ZlcmFsbCBzZXQgRy4gQ29uc2lkZXIgRmlndXJlIDIsIHdoZXJlIHRoZSBzaXplIG9mIHRoZSBzZXRzLCBhbmQgdGhlcmVmb3JlIHRoZSBzaXplIG9mIHRoZSBpbnRlcnNlY3Rpb24sIGlzIGVubGFyZ2VkIGNvbXBhcmVkIHdpdGggRmlndXJlIDEuIFNvbWUgbnVtYmVyICRuX3IkIG9mIHRoZSBpbnRlcnNlY3RpbmcgZ2VuZXMgaGF2ZSBvY2N1cnJlZCBieSBjaGFuY2UuIFRoZSB0b3RhbCBudW1iZXIgb2Ygb2JzZXJ2ZWQgaW50ZXJzZWN0aW5nIGdlbmVzIGlzIHRoZXJlZm9yZSBjb21wb3NlZCBvZiBhICJiYWNrZ3JvdW5kIiBwYXJ0ICRuX3IkIGFuZCBhICJ0cnVlIiBwYXJ0ICRuX3gkICh0aGV5IHVzZSAkeCQgYXMgJG5feCQgaW4gdGhlIHRleHQpLCAkbl97QVwmQn0gPSBuX3IgKyBuX3gkLg0KDQohW0ZpZ3VyZSAyXShjb25jb3JkYW5jZV9maWd1cmUyLnBuZyl7d2lkdGg9MjUlfQ0KDQpTaW5jZSB0aGUgcHJvYmFiaWxpdHkgb2Ygb2JzZXJ2aW5nIG92ZXJsYXAgYnkgY2hhbmNlIGlzIGRlcGVuZGVudCB1cG9uIHRoZSBzaXplIG9mIHRoZSBzZXRzICRuXzEkIGFuZCAkbl8yJCwgdGhlIGF1dGhvcnMgcmVhc29uIHRoYXQgdGhleSBjYW4gY29tcHV0ZSAkbl94JCBieSBleGFtaW5pbmcgdGhlIHByb2JhYmlsaXR5IG9mIGludGVyc2VjdGlvbiBmb3IgdGhlIHNldCBvZiBnZW5lcyB0aGF0IHJlc3VsdHMgYWZ0ZXIgcmVtb3ZpbmcgdGhlIHNpemUgb2YgdGhlICJ0cnVlIiBvdmVybGFwLg0KDQokJCBuXzAgPSBuX3ggKyBcZnJhY3sobl8xIC0gbl94KShuXzIgLSBuX3gpfXtOLW5feH0gJCQNCg0KU2luY2UgJG5fMCwgbl8xLCBuXzIsJCBhbmQgJE4kIGFyZSBrbm93biwgd2UgY2FuIHNvbHZlIHRoaXMgZXF1YXRpb24gZm9yICRuX3gkIHRvIGZpbmQgdGhlIGJhY2tncm91bmQgY29ycmVjdGVkIG51bWJlciBvZiAidHJ1ZSIgb3ZlcmxhcHBpbmcgZ2VuZXMuDQoNCiQkIG5feCA9IFxmcmFje05uXzAtbl8xbl8yfXtuXzArTi1uXzEtbl8yfSAkJA0KDQpJbiBvdGhlciB3b3JkcywgdGhlIG51bWJlciBvZiBnZW5lcyB3ZSBleHBlY3QgdG8gb3ZlcmxhcCBieSBjaGFuY2UgZGVwZW5kcyB1cG9uIHRoZSBudW1iZXIgb2YgdG90YWwgZ2VuZXMsIHRoZSBzaXplIG9mIHRoZSBvYnNlcnZlZCBvdmVybGFwLCBhbmQgdGhlIHNpemUgb2YgZWFjaCBzZXQuDQoNCk5vdGUgdGhlIGVmZmVjdCBvZiBiYWNrZ3JvdW5kIGNvcnJlY3Rpb24gaXMgc3VidGxlIGZvciBzbWFsbCB2YWx1ZXMgb2YgJG5fMSQgYW5kICRuXzIkLiBCZWxvdyBhcmUgcGxvdHMgb2Ygb2JzZXJ2ZWQgKCRuX3tBXGNhcCBCfSQpIHZlcnN1cyAidHJ1ZSIgb3ZlcmxhcCAoJG5feCQpIHVzaW5nIHRoZSBmb3JtdWxhIHByZXNlbnRlZCBpbiB0aGUgcGFwZXIgZm9yIGRpZmZlcmVudCB2YWx1ZXMgb2YgJG5fMSQgYW5kICRuXzIkLCBnaXZlbiBhIGNvbnN0YW50ICROJCBvZiAyMCwwMDAgZ2VuZXMuDQoNCmBgYHtyIGVjaG89RkFMU0V9DQpsaWJyYXJ5KGdncGxvdDIpDQoNCmYueCA8LSBmdW5jdGlvbihuMCxuMT0zMDAwLG4yPTUwMDAsTj0yMDAwMCkgew0KICB4IDwtIChuMCpOLW4xKm4yKS8objArTi1uMS1uMikNCiAgaWYoeCA8PSBuMCkgew0KICAgIHgNCiAgfSBlbHNlIHsNCiAgICBOQQ0KICB9DQp9DQoNCmRvLnBsb3QgPC0gZnVuY3Rpb24objEpIHsNCiAgbjAgPC0gc2VxKDEsbjEpDQogIE4gPC0gMjAwMDANCiAgcGxvdChuMCwNCiAgICAgICB2YXBwbHkobjAsZnVuY3Rpb24oeCkgZi54KHgsbjEsMTAwLE4pLDApLA0KICAgICAgIHR5cGU9J2wnLCBsd2Q9MiwNCiAgICAgICB4bGFiPSdPYnNlcnZlZCBvdmVybGFwJywgeWxhYj0nVHJ1ZSBpbnRlcnNlY3Rpb24nLA0KICAgICAgIGNvbD0ncmVkJywNCiAgICAgICB4bGltPWMoMCxuMSksDQogICAgICAgeWxpbT1jKDAsbjEpDQogICkNCiAgbGluZXMobjAsDQogICAgICAgIHZhcHBseShuMCxmdW5jdGlvbih4KSBmLngoeCxuMSwxMDAwLE4pLDApLA0KICAgICAgICB0eXBlPSdsJywgbHdkPTIsDQogICAgICAgIGNvbD0nYmx1ZScNCiAgKQ0KICBsaW5lcyhuMCwNCiAgICAgICAgdmFwcGx5KG4wLGZ1bmN0aW9uKHgpIGYueCh4LG4xLDUwMDAsTiksMCksDQogICAgICAgIHR5cGU9J2wnLCBsd2Q9MiwNCiAgICAgICAgY29sPSdncmVlbicNCiAgKQ0KICBsaW5lcyhuMCwNCiAgICAgICAgdmFwcGx5KG4wLGZ1bmN0aW9uKHgpIGYueCh4LG4xLDEwMDAwLE4pLDApLA0KICAgICAgICB0eXBlPSdsJywgbHdkPTIsDQogICAgICAgIGNvbD0nZ3JleScNCiAgKQ0KICBsaW5lcyhuMCxuMCxjb2w9ImdyZXkiLGx0eT0yKQ0KICBsaW5lcyhuMCxyZXAoMCxsZW5ndGgobjApKSxjb2w9ImdyZXkiLGx0eT0yKQ0KICBsZWdlbmQoMSxuMSxsZWdlbmQ9YygNCiAgICAiSWRlbnRpdHkvWmVybyBsaW5lIiwNCiAgICBwYXN0ZTAoIm4xPSIsbjEsIiwgbjI9MTAwIiksDQogICAgcGFzdGUwKCJuMT0iLG4xLCIsIG4yPTEwMDAiKSwNCiAgICBwYXN0ZTAoIm4xPSIsbjEsIiwgbjI9NTAwMCIpLA0KICAgIHBhc3RlMCgibjE9IixuMSwiLCBuMj0xMDAwMCIpDQogICksDQogIGNvbD1jKCJncmV5IiwicmVkIiwiYmx1ZSIsImdyZWVuIiwiZ3JleSIpLA0KICBsdHk9YygyLDEsMSwxLDEpDQogICkNCn0NCmBgYA0KDQpgYGB7ciBmaWczLCBmaWcuaGVpZ2h0PTR9DQpmaWczIDwtIGRvLnBsb3QoMTAwKQ0KYGBgDQoNCmBgYHtyIGZpZzQsIGZpZy5oZWlnaHQ9NH0NCmZpZzQgPC0gZG8ucGxvdCgxMDAwKQ0KYGBgDQoNCmBgYHtyIGZpZzUsIGZpZy5oZWlnaHQ9NH0NCmZpZzUgPC0gZG8ucGxvdCgxMDAwMCkNCmBgYA==