# compute jaccard similarity on dataframe

Tackling the easier, unweighted, version of the problem can be done with the following steps:

1. create a pivot table with your current dataframe

``````p = df.pivot_table(
index='bag_number',
columns='item',
values='quantity',
).fillna(0)  # Convert NaN to 0
``````
2. follow the example in your linked question to compute the Jaccard distance with `scipy`

``````from scipy.spatial.distance import jaccard, pdist, squareform

m = 1 - squareform(pdist(p.astype(bool), jaccard))
sim = pd.DataFrame(m, index=p.index, columns=p.index)
``````

Result:

``````bag_number         1         2         3         4         5
bag_number
1           1.000000  0.000000  0.333333  0.000000  0.500000
2           0.000000  1.000000  0.333333  0.000000  0.000000
3           0.333333  0.333333  1.000000  0.333333  0.666667
4           0.000000  0.000000  0.333333  1.000000  0.500000
5           0.500000  0.000000  0.666667  0.500000  1.000000
``````

The weighted version is only slightly more complicated. The `pdist` function only supports a vector that it will apply to all comparisons, so you’ll need to create a custom similarity (or distance) function. According to Wikipedia, the weighted version can be computed as follows:

``````import numpy as np

def weighted_jaccard_distance(x, y):
arr = np.array([x, y])
return 1 - arr.min(axis=0).sum() / arr.max(axis=0).sum()
``````

Now you can compute the weighted similarity

``````sim_weighted = pd.DataFrame(
data=1 - squareform(pdist(p, weighted_jaccard_distance)),
index=p.index,
columns=p.index,
)
``````

Result:

``````bag_number     1         2         3         4         5
bag_number
1           1.00  0.000000  0.250000  0.000000  0.500000
2           0.00  1.000000  0.142857  0.000000  0.000000
3           0.25  0.142857  1.000000  0.111111  0.300000
4           0.00  0.000000  0.111111  1.000000  0.285714
5           0.50  0.000000  0.300000  0.285714  1.000000
``````