Applying Scalers using DataFrameMapper()

When building a model, I often perform scaling and hot-encoding variables while preprocessing data. It is a part of the process I dislike because I never find a consistent, elegant way to do it and replicate it.

Recently, I ran into the class DataFrameMapper(), which improves this preprocessing for me. This approach is still error prone, as the mapping must be done manually. However, it is an approach I find simple and consistent when dealing with a few dozen independent variables.

To begin, assume we have a pandas data frame, df, of the form

print(df.head())

if_sale  clicks platform   costs  views
      0       8        a  125.48   1051
      0      19        c  126.03    951
      0       0        a  130.07   1061
      0       6        d  117.06   1009
      0      15        d  105.79    937

where if_sale is the dependent variable, and the rest are independent variables. As it is often done, perform the train-and-test split

# separate dependent and independent variables

X=df[df.columns[1:].to_list()]  # Independent
y=df[[df.columns[0]]]  # dependent

# Split dataset into training set and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Next, define which independent variables will be scaled and which will be hot-enconded. In an tuple, pair those variables to be scaled and hot-endonced with the preprocessing classes StandardScaler() and LabelBinarizer() respectively (or any other appropriate scaling of your choice)

# From the dependent variables, define what is going to be scaled and hot-encoded

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelBinarizer

column_tuples = [
    (['clicks'], StandardScaler()),
    (['platform'], LabelBinarizer()),
    (['costs'], StandardScaler()),
    (['views'], StandardScaler())
]

Finally, instantiate the mapper class, fit and transform the data frames.

# Create a data frame mapper

from sklearn_pandas import DataFrameMapper

mapper = DataFrameMapper(column_tuples, df_out=True)

# Transform X_train according to X_train

X_train_processed = mapper.fit_transform(X_train)

# Transform X_test according to X_train

X_test_processed = mapper.transform(X_train)

As always, beware to fit and transform the X_train according to itself, and to transform (not to fit) the X_test according to the X_train. Treat test data as new data. Always.

For more on DataFrameMapper(), you can refer to the documentation here. The script can be found in GitHub.