split dataframe into chunks of unique values

This solution picks out exactly one element from each subgroup for batch 1-48. Batch 49-50 are picked randomly. The numbers from As and Bs are not considered.

Logic

  1. Shuffle each subgroup by performing random permutation on the index of each subgroup.
  2. The first elements from each subgroup form the first batch, the second elements form the second batch, etc.

Code

# data
df = pd.DataFrame(
    {"group": ["A"]*150 + ["B"]*150,
     "subgroup": ["a", "b", "c", "d", "e", "f"] * 48 + ["a", "b"]*3 + ["c", "d"]*2 + ["e", "f"],
     "value": range(0, 3000, 10)}
)
df.index.name = "index"

# length parameters
subgroups = df["subgroup"].unique()  # np.array(["a", "b", "c", "d", "e", "f"])
l = len(subgroups)  # 6
m = 48  # number of regular batches
n = int(len(df) / l)  # 300/6=50

# storage
arr = np.zeros((m, l))  # for batch 1-48
ls_rest = []  # for batch 49-50

# shuffle
for i, subgroup in enumerate(subgroups):
    perm = np.random.permutation(df.index[df["subgroup"] == subgroup])
    arr[:, i] = perm[:m]  # for regular batches
    ls_rest += list(perm[m:])  # for the rest

# assign batch 1-48
df["batch"] = 0
for i in range(m):
    df.loc[arr[i,:], "batch"] = i+1

# assign batch 49-50
for i in range(n-m):
    df.loc[ls_rest[i*l:(i+1)*l], "batch"] = m+1+i

# sorting is omitted

Output

regular batches

One can see that there is indeed one element per subgroup within each batch.

print(df.sort_values(["batch", "subgroup"]).head(13))

      group subgroup  value  batch
index                             
48        A        a    480      1
13        A        b    130      1
134       A        c   1340      1
171       B        d   1710      1
262       B        e   2620      1
5         A        f     50      1
240       B        a   2400      2
291       B        b   2910      2
152       B        c   1520      2
93        A        d    930      2
136       A        e   1360      2
59        A        f    590      2
24        A        a    240      3

the rest

print(df.sort_values(["batch", "subgroup"]).tail(13))

      group subgroup  value  batch
index                             
29        A        f    290     48
120       A        a   1200     49
222       B        a   2220     49
276       B        a   2760     49
61        A        b    610     49
133       A        b   1330     49
289       B        b   2890     49
98        A        c    980     50
206       B        c   2060     50
45        A        d    450     50
295       B        d   2950     50
166       B        e   1660     50
233       B        f   2330     50

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top