how do i check whether two text datasets are from different distributions?

The question “are text A and text B coming from the same distribution?” is somehow poorly defined. For example, these two questions (1,2) can be viewed as generated from the same distribution (distribution of all questions on StackExchange) or from different distributions (distribution of two different subdomains of StackExchange). So it’s not clear what is the property that you want to test.

Anyway, you can come up with any test statistic of your choice, approximate its distribution in case of “single source” by simulation, and calculate the p-value of your test.

As a toy example, let’s take two small corpora: two random articles from English Wikipedia. I’ll do it in Python

import requests
from bs4 import BeautifulSoup
urls = [
texts = [BeautifulSoup(requests.get(u).text).find('div', {'class': 'mw-parser-output'}).text for u in urls]

Now I use a primitive tokenizer to count individual words in texts, and use root mean squared difference in word relative frequencies as my test statistic. You can use any other statistic, as long as you calculate it consistently.

import re
from collections import Counter
from copy import deepcopy
TOKEN = re.compile(r'([^\W\d]+|\d+|[^\w\s])')
counters = [Counter(re.findall(TOKEN, t)) for t in texts]
print([sum(c.values()) for c in counters])  
# [5068, 4053]: texts are of approximately the same size

def word_freq_rmse(c1, c2):
    result = 0
    vocab = set(c1.keys()).union(set(c2.keys()))
    n1, n2 = sum(c1.values()), sum(c2.values())
    n = len(vocab)
    for word in vocab:
        result += (c1[word]/n1 - c2[word]/n2)**2 / n
    return result**0.5

# rmse is 0.001178, but is this a small or large difference?

I get a value of 0.001178, but I don’t know whether it’s a large difference. So I need to simulate the distribution of this test statistic under the null hypothesis: when both texts are from the same distribution. To simulate it, I merge two texts into one, and then split them randomly, and calculate my statistic when comparing these two random parts.

import random
tokens = [tok for t in texts for tok in re.findall(TOKEN, t)]
split = sum(counters[0].values())
distribution = []
for i in range(1000):
    c1 = Counter(tokens[:split])
    c2 = Counter(tokens[split:])
    distribution.append(word_freq_rmse(c1, c2))

Now I can see how unusual is the value of my observed test statistic under the null hypothesis:

observed = word_freq_rmse(*counters)
p_value = sum(x >= observed for x in distribution) / len(distribution)
print(p_value)  # it is 0.0
print(observed, max(distribution), sum(distribution) / len(distribution)) # 0.0011  0.0006 0.0004

We see that when texts are from the same distribution, my test statistic is on average 0.0004 and almost never exceeds 0.0006, so the value of 0.0011 is very unusual, and the null hypothesis that two my texts originate from the same distribution should be rejected.

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top