Data Science: August 2015

Feature engineering is a very important integral step in the data science process. This can be a deciding KPI for the model and also to the analyst in determining the intelligence used for the same.

It is the central core part of the modeling process and the performance of any solution would definitely depend upon this step. Analyst intelligence could easily be identified here as this integral step makes the overall solution better & best.

There are many feature-engineering techniques both automated packages along with step-by-step manual approach based upon the nature of the problem and problem output. Like in case of Regression Techniques, it is indeed a necessity to create the derived and standardized/normalized features to fit into the algorithm. But in case of Machine Learning Techniques like Decision Trees, Random Forests, it may not be a necessity as it is the case with other because of the inherent nature and concept of the Data Mining Techniques.

One of the important technique as part of the feature engineering process is Time-lining the features to the event that had occurred previously for a supervised learning problem.

Let me take an example and demonstrate the whole process for time lining along with feature summary selection

A dataset has about 1000 accounts with 70% of them are in active mode and rest 30% of them are in in-active mode. Now the problem statement is, using the behaviour of the in-active mode accounts is it possible to predict which particular active mode account has the possibility to get into the in-active mode before a month, 2 months or so.

So, first step is to work with the dates for each account or accounts set in active mode and in-active mode

Agreement Signed Date – Date When Customer has been acquired.

Training End Date – Date when the product implementation has been completed

Usage Start Date – Date when the customer had started using the product/service

Renewal Date – Date when the customer has to renew the account for the future service/product

Metrics can be of the following type –

1. Days between the Date of Acquisition & Last Used

2. Days between the Date of Acquisition & Training End Date

3. Days between the Date of Acquisition & Implementation

4. Days for Renewal

5. First Used

6. Last Used

Second Step is to visualize the data for each feature for all the accounts and see if there is any insight/behavior that can be established prior to the occurrence of the event

A set of features for the group of accounts for a period of nine months can be of below graph

Third important step is to line up the usage summaries based upon the nature/data type of the feature

1. Sum/Average – Summary of the usages for the observation period

2. Max/Min – Computing the Maximum or Minimum usages for the same observation period

3. Percent Change (Recent) – Computing the percent change with respect to the usages in the immediate observation period to the next minus 1 observation period

4. Percent Change (Historical) – Computing the percent change with respect to the usages in the observation period that had happened last year before the renewal date

5. Standardization – Usage summaries has to be standardized based upon the users or licenses bought for each account

6. Scaling – Scaling becomes an important step in case of implementing statistical techniques like Regression, time series, etc. Scaling may of transforming the variables/features into the log or square, cubic, etc

For the in-active accounts, the last used date can be the churned date if churn date is not available. Going back from the last used to a period respective to the problem can help to unearth insights/behavior for each account or for a set of accounts. This becomes a guiding factor for computing a suitable function applicable to the active accounts to be predicted as in active before the event/milestone would happen.

My next note of fitting the model with the summaries and testing the performance of the model.

Data Science

Thursday, 20 August 2015

Feature Engineering for building a Data Science Model