I believe you need split values by \s*-\s*
– here \s*
means zero or more spaces, then flatten in list comprehension all combinations:
from itertools import combinations
L = ['-'.join(y) for x in df['DNA'].str.split('\s*-\s*') for y in combinations(x, 2)]
If necessary sorting values:
L = ['-'.join(sorted(y)) for x in df['DNA'].str.split('\s*-\s*')
for y in combinations(x, 2)]
Last pass to Series
and call Series.value_counts
:
s = pd.Series(L)
print (s)
0 xx345-b324
1 xx345-c82
2 xx345-d13
3 xx345-c14
4 b324-c82
5 b324-d13
6 b324-c14
7 c82-d13
8 c82-c14
9 d13-c14
10 xx345-a22
11 xx345-c14
12 xx345-d13
13 a22-c14
14 a22-d13
15 c14-d13
16 a34-f12
17 a34-r27
18 a34-fg98
19 a34-tr12
20 f12-r27
21 f12-fg98
22 f12-tr12
23 r27-fg98
24 r27-tr12
25 fg98-tr12
dtype: object
s1 = s.value_counts()
print (s1)
xx345-c14 2
xx345-d13 2
c14-d13 1
f12-tr12 1
xx345-a22 1
a34-fg98 1
f12-r27 1
a34-r27 1
c82-c14 1
f12-fg98 1
a22-c14 1
a34-tr12 1
a34-f12 1
b324-d13 1
r27-tr12 1
xx345-c82 1
d13-c14 1
b324-c14 1
xx345-b324 1
r27-fg98 1
fg98-tr12 1
b324-c82 1
c82-d13 1
a22-d13 1
dtype: int64
EDIT:
from itertools import combinations
L = []
for x in df['DNA'].str.split('\s*-\s*'):
if len(x) > 1:
for y in combinations(x, 2):
L.append('-'.join(sorted(y)))
else:
L.append(x[0])
s = pd.Series(L)
print (s)
CLICK HERE to find out more related problems solutions.