Feature Engineering

There are few types of columns/Variables which are simply useless for the machine learning algorithm because they do not hold any patterns with respect to the target variable

For example, an ID column. It is useless because every row has a unique id and hence it does not has any patterns in it. It is wrong to say if the ID is 1002 then the sales will be 20,000 units. However, if you pass this variable as a predictor to the machine learning algorithm, it will try to use ID values in order to predict sales and come up with some garbage logic!
The rule of Garbage in–> Garbage out applies!

Few more types of columns are also useless like row ID, phone numbers, complete address, comments, description, dates, etc. All these are not useful because each row contains a unique value and hence there will not be any patterns.

Some of these can be indirectly used by deriving certain parts from it. For example, dates cannot be used directly, but the derived month, quarter or week can be used because these may hold some patterns with respect to the target variable. This process of creating new columns from existing columns is known as Feature Engineering.

Remove those columns which do not exhibit any patterns with respect to the target variable

The thumb rule is: Remove those columns which do not exhibit any patterns with respect to the target variable. Do this when you are absolutely sure, otherwise, give the column a chance.

Leave a Reply Cancel reply