Feature engineering is a very important
integral step in the data science process. This can be a deciding KPI for the
model and also to the analyst in determining the intelligence used for the
same.
It is the central core part of the modeling
process and the performance of any solution would definitely depend upon this
step. Analyst intelligence could easily be identified here as this integral
step makes the overall solution better & best.
There are many feature-engineering
techniques both automated packages along with step-by-step manual approach
based upon the nature of the problem and problem output. Like in case of
Regression Techniques, it is indeed a necessity to create the derived and
standardized/normalized features to fit into the algorithm. But in case of
Machine Learning Techniques like Decision Trees, Random Forests, it may not be
a necessity as it is the case with other because of the inherent nature and
concept of the Data Mining Techniques.
One of the important technique as part of
the feature engineering process is Time-lining the features to the event that
had occurred previously for a supervised learning problem.
Let me take an example and demonstrate the
whole process for time lining along with feature summary selection
A dataset has about 1000 accounts with 70%
of them are in active mode and rest 30% of them are in in-active mode. Now the
problem statement is, using the behaviour of the in-active mode accounts is it
possible to predict which particular active mode account has the possibility to
get into the in-active mode before a month, 2 months or so.
So, first step is to work with the dates
for each account or accounts set in active mode and in-active mode
Agreement Signed Date – Date When Customer
has been acquired.
Training End Date – Date when the product
implementation has been completed
Usage Start Date – Date when the customer
had started using the product/service
Renewal Date – Date when the customer has
to renew the account for the future service/product
Metrics can be of the following type –
1.
Days between the Date of
Acquisition & Last Used
2.
Days between the Date of
Acquisition & Training End Date
3.
Days between the Date of
Acquisition & Implementation
4.
Days for Renewal
5.
First Used
6.
Last Used
Second Step is to visualize the data for
each feature for all the accounts and see if there is any insight/behavior that
can be established prior to the occurrence of the event
A set of features for the group of accounts
for a period of nine months can be of below graph
Third important step is to line up the
usage summaries based upon the nature/data type of the feature
1.
Sum/Average – Summary of the
usages for the observation period
2.
Max/Min – Computing the Maximum
or Minimum usages for the same observation period
3.
Percent Change (Recent) –
Computing the percent change with respect to the usages in the immediate
observation period to the next minus 1 observation period
4.
Percent Change (Historical) –
Computing the percent change with respect to the usages in the observation
period that had happened last year before the renewal date
5.
Standardization – Usage summaries
has to be standardized based upon the users or licenses bought for each account
6.
Scaling – Scaling becomes an
important step in case of implementing statistical techniques like Regression,
time series, etc. Scaling may of transforming the variables/features into the
log or square, cubic, etc
For the in-active accounts, the last used
date can be the churned date if churn date is not available. Going back from
the last used to a period respective to the problem can help to unearth
insights/behavior for each account or for a set of accounts. This becomes a
guiding factor for computing a suitable function applicable to the active
accounts to be predicted as in active before the event/milestone would happen.
My next note of fitting the model with the
summaries and testing the performance of the model.

