Categories
categorical-data machine-learning pandas python scikit-learn

Feature preprocessing of both continuous and categorical variables (of integer type) with scikit-learn

The main goals are as follows:

  1. Apply StandardScaler to continuous variables

  2. Apply LabelEncoder and OnehotEncoder to categorical variables

The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler would result in undesired effects.

On the flip side, the StandardScaler would scale the integer based categorical variables, which is also not what we want.

Since continuous variables and categorical ones are mixed in a single Pandas DataFrame, what’s the recommended workflow to approach this kind of problem?

The best example to illustrate my point is the Kaggle Bike Sharing Demand dataset, where season and weather are integer categorical variables