Thursday, 20 August 2015

Feature Engineering for building a Data Science Model

Feature engineering is a very important integral step in the data science process. This can be a deciding KPI for the model and also to the analyst in determining the intelligence used for the same.

It is the central core part of the modeling process and the performance of any solution would definitely depend upon this step. Analyst intelligence could easily be identified here as this integral step makes the overall solution better & best.

There are many feature-engineering techniques both automated packages along with step-by-step manual approach based upon the nature of the problem and problem output. Like in case of Regression Techniques, it is indeed a necessity to create the derived and standardized/normalized features to fit into the algorithm. But in case of Machine Learning Techniques like Decision Trees, Random Forests, it may not be a necessity as it is the case with other because of the inherent nature and concept of the Data Mining Techniques.

One of the important technique as part of the feature engineering process is Time-lining the features to the event that had occurred previously for a supervised learning problem.

Let me take an example and demonstrate the whole process for time lining along with feature summary selection

A dataset has about 1000 accounts with 70% of them are in active mode and rest 30% of them are in in-active mode. Now the problem statement is, using the behaviour of the in-active mode accounts is it possible to predict which particular active mode account has the possibility to get into the in-active mode before a month, 2 months or so.

So, first step is to work with the dates for each account or accounts set in active mode and in-active mode




Agreement Signed Date – Date When Customer has been acquired.

Training End Date – Date when the product implementation has been completed

Usage Start Date – Date when the customer had started using the product/service

Renewal Date – Date when the customer has to renew the account for the future service/product

Metrics can be of the following type –

1.     Days between the Date of Acquisition & Last Used
2.     Days between the Date of Acquisition & Training End Date
3.     Days between the Date of Acquisition & Implementation
4.     Days for Renewal
5.     First Used
6.     Last Used

Second Step is to visualize the data for each feature for all the accounts and see if there is any insight/behavior that can be established prior to the occurrence of the event

A set of features for the group of accounts for a period of nine months can be of below graph







Third important step is to line up the usage summaries based upon the nature/data type of the feature

1.     Sum/Average – Summary of the usages for the observation period
2.     Max/Min – Computing the Maximum or Minimum usages for the same observation period
3.     Percent Change (Recent) – Computing the percent change with respect to the usages in the immediate observation period to the next minus 1 observation period
4.     Percent Change (Historical) – Computing the percent change with respect to the usages in the observation period that had happened last year before the renewal date
5.     Standardization – Usage summaries has to be standardized based upon the users or licenses bought for each account
6.     Scaling – Scaling becomes an important step in case of implementing statistical techniques like Regression, time series, etc. Scaling may of transforming the variables/features into the log or square, cubic, etc

For the in-active accounts, the last used date can be the churned date if churn date is not available. Going back from the last used to a period respective to the problem can help to unearth insights/behavior for each account or for a set of accounts. This becomes a guiding factor for computing a suitable function applicable to the active accounts to be predicted as in active before the event/milestone would happen.


My next note of fitting the model with the summaries and testing the performance of the model.

Saturday, 20 December 2014

Define Hypothesis and testing of Hypothesis(5/13)

Had to take a small break for my series of posts due to critical work commitments. Had an opportunity to work on some special problems pertaining to Un-Supervised & Semi-Supervised learning. I will try to create a separate post on the learning’s from the problems.

Coming back to my series - Hypothesis Testing   

Next step after knowing the summaries is to create/assume few hypothesis based upon the outputs. Most of the assumptions would be on the distribution of the data and in majority of the cases, data is not at all uniformly distributed. Hence, it becomes necessary to make few assumptions/guesses/hunch about the data based upon the problem statement and then test the assumption using various statistical tests to find out whether the assumption is correct or not based upon the statistically significant p value.

This will enable the analyst to establish a direction to follow and then prevents blind search and indiscriminate gathering of the data.

Few major Parametric/Standard Test – Z – Test, t – Test, F – Test, Chi-square Test, etc

Few Non-Parametric or Distribution Free Test –
One Sample Tests- Kolmogorov – smirnov one sample test, Randomness Test, One Sample Sign Test, etc

Two Sample Tests – Two-Sample Sign Test, Fisher-Irwin Test, The Median Test, etc

High Level Process Map/Steps

1.     State Null Hypothesis (Ho) as well as Alternate Hypothesis (Ha)
2.     Specify the level of significance or the
3.     Decide the correct sampling distribution
4.     Create few random samples and ascertain the appropriate value for sample data
5.     Calculate the probability of the sample data
6.     Is the probability of the sample data is equal or smaller than the p value based upon the significance table
7.     If the p value is smaller than Reject the Null Hypothesis (Ho) or accept the Null Hypothesis (Ho) in case of other way round

For example, as pointed in my earlier post that the mean value of age of the car is 55.95/56 months. Now the assumption would be whether that 56 months is holds good for all the categories of the cars or it has different mean period for different car categories.

Null Hypothesis would be that the mean is same across all the car categories

Alternate Hypothesis would be that the mean is different for different car categories

Now we test the Hypothesis by taking the category specific data and ascertain test statistic z sample mean using


Then we need compare it with the critical value based upon the defined level of significance to accept or reject the null hypothesis.

Tuesday, 23 September 2014

Data Pre-processing – 60-70% of the effort (4/13)


It is indeed a mandatory and an important task in the process of finding a solution to the business problem using the data.  

As it is a known fact that Data in real world is Dirty. The reason for this data to be dirty may be of

a)    Incomplete Data – It may be incomplete because of lacking attribute values or lacking attributes of interest, etc. It might also be a case of different considerations made between the time when the data was collected and when it was analyzed. It might also occur due to Human/Hardware/Software problems. For example, it may be like missing value for an important variable/feature/attribute in the dataset – Occupation = ””
b)   Noisy Data – It may be a noisy because of faulty data collection instruments or human or software or hardware error or it may be due to the data transmission from one place/position to another place/position. Mostly, these errors result into the outliers in the dataset. For example, it may be like an age of 1000 for a person in an age attribute/feature.
c)    Inconsistent Data – It may be inconsistent because of dependency violation or data is getting accumulated from various sources. As a result, there is a possibility of duplicates getting generated. It might also be the case where initially data is stored in terms of integer like 1,2,3 now with the policy change; it got converted to A, B, C.
d)   Data Filtering – It is also required to filter out attributes/features that are irrelevant to the current business problem. Business data warehouse would have large number of attributes as part of the product but it would be the case that only limited number of features are required to work on the defined business problem

Quality of the data is an integral part of finding qualitative solution to the business problem that we are dealing with it. There are various ways to convert the dirty data to a more meaningful and qualitative data and below are few generalized measures for the same.

1.     Is data accurate or not
2.     Is data complete or not – Data not recorded, un-available, not available, etc.
3.     Is data consistent or not – Few data points are modified but some not, dangling, etc.
4.     How timely the data is – Frequency of the data update, Is it complete data update or only the change, Any new additions of variables and its retrospective update, etc.
5.     Is data believable – How trustable the data is, What is the source of the data, Is there any scope for data breakages, etc.
6.     Is data Interpretable – What is the format of the data, How easily the data is understandable, etc.

Few major generalized tasks for answering those above questions those are more or less present in any dataset

1.     Top level understanding
2.     Shortlisting the features/variables/attributes applicable for the current problem in hand
3.     Data Cleaning
a.     Identifying and ascertaining the ways to fill out the missing values that are present in the features/variables
b.     Smoothing of noisy data
c.      Identifying the outliers and removing them from the dataset if the variable in question is not important. If the variable is very important for the problem and if there are more outliers then ascertaining a way to deal with them through statistical experiments
d.     Identify the inconsistencies that would be present in the dataset
4.     Data Reduction
a.     Reducing the dimensions/attributes/features from the overall dataset. This would help the same results with an optimized/simplified features depending upon the problem
b.     Sampling measures
5.     Data Transformation and Data Discretization
a.     Normalization of the data
b.     Concept hierarchy generation

Once the data is turned into the qualitative asset then it is the smartness of the analyst to unearth the diamond that is present in it. I will cover the insights part later in my next note

Wednesday, 23 July 2014

Get the Underlying Data(3/13)

Now that it is clear to define the problem definition/objective, it is the time to look for the underlying data to find the possible solution for the problem statement.

Generally, there are like 2 types of data stores in the data science community

Structured Data

Structured data is stored in a fixed/pre-defined format having it saved in a record or file or a pre-defined data store. Generally, it is the combination of rows and columns saved in a fixed format representing in a specific table. Most of the business problems can be solved using the available structured data. Best part of the structured data is its ease in storing the data, querying it or analysing it. With the advancement of technological innovation, it is now becoming cheaper for the companies to have a large data store in the organisations.


Un-Structured Data

Un-structured data is a form of an information that doesn’t have a pre-defined format or organised in a pre-defined manner. It would typically be in a heavy text format consisting of numbers, facts, or speech notes or videos. Data Science community is working on some of the cool problems pertaining to text using data mining, text mining, etc. Like trying to understand the sentiment of the product in the market, ascertaining spam from the video upload, etc.

Essentially, it is the problem, which drives the requirement of the data along with the availability. Few problem statements don’t need un-structured data while few would only need it. It is also dependent upon the availability of the data even though the problem might demand for the requirement.

Availability of the data is key to find solution for the defined problem statement and it is really important to consolidate all the data from varied sources to analyse it.


Next logical step after defining the problem statement and getting access to data is to define a process/methodology to process the data for learning from it.