the shape is changed when preprocessing with the column transformer and predicting the testing data

I have tried to create a Minimal Reproducible Example of your problem, and I do not run into any errors myself. Can you run it on your side? See if there are any important differences between the dataframe created here and yours?

Note that:

  • When transforming your test data, you should only transform the data with the ColumnTransformer and not fit it
  • The OneHotEncoder is initialized with handle_unknown = 'ignore'
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Parameters to tweak
n_categories = 10 # Number of categorical columns
groups_by_cat = [3 , 10] # Number of groups which a category will have, to be chosen 
                        # randomly between these two numbers
n_rows = 20
n_binary_cols = 10

# code
list_alpha = list('abcdefghijklmnopqrstuvwxyz')
np.random.seed(42)
groups = []

# names of the columns of the dataframe
col_names = ['X'+str(i) for i in range(n_categories + n_binary_cols)]

# first we generate randomly a set of groups that each category can have
for i in range(n_categories):
    np.random.randn()
    temp_groups = []
    temp_n_groups = np.random.randint(*groups_by_cat)
    for k in range(temp_n_groups):
        group = "".join(np.random.choice(list_alpha,2, replace = True))
        temp_groups.append(group)
    groups.append(temp_groups)

# then we generate n_rows taking samples from the groups generated previously
array_categories = np.random.choice(groups[0],(n_rows,1), replace = True)
for i in range(1,n_categories):
    temp_column = np.random.choice(groups[i],(n_rows,1), replace = True)
    array_categories = np.hstack((array_categories, temp_column))
    

# we generate an array containing the binary columns
array_binaries = np.random.randint(0, 2, (n_rows, n_binary_cols))


# we create the dataframe concatenating together the two arrays
df = pd.DataFrame(np.hstack((array_categories, array_binaries)), columns = col_names)

y = np.random.random_sample((n_rows,1))

# split
X_train, X_test, y_train, y_test = train_test_split(df, y)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# create column transformer
cat_cols = df.select_dtypes(include="object").columns
ct = make_column_transformer((OneHotEncoder(handle_unknown='ignore'),cat_cols),
                             remainder='passthrough')

# fit transform the ColumnTransformer
X_train_transformed = ct.fit_transform(X_train)

# fit linearRegression and predict
linereg = LinearRegression()
linereg.fit(X_train_transformed,y_train)
X_test_transformed = ct.transform(X_test)

print("\nSizes of transformed arrays")
print(X_train_transformed.shape)
print(X_test_transformed.shape)

linereg.predict(X_test_transformed)

Note that the test data, is only transformed with the ColumnTransformer:

X_test_transformed = ct.transform(X_test)

Otherwise the OneHotEncoder() will calculate again the necessary columns for your test data, which might not be exactly the same columns than for your training data (if for example the test data does not have some of the groups that were found on your training data). Here you have more information in the differences between fit fit_transform and transform

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top