Data Science: 2014

Saturday, 20 December 2014

Define Hypothesis and testing of Hypothesis(5/13)

Had to take a small break for my series of posts due to critical work commitments. Had an opportunity to work on some special problems pertaining to Un-Supervised & Semi-Supervised learning. I will try to create a separate post on the learning’s from the problems.

Coming back to my series - Hypothesis Testing

Next step after knowing the summaries is to create/assume few hypothesis based upon the outputs. Most of the assumptions would be on the distribution of the data and in majority of the cases, data is not at all uniformly distributed. Hence, it becomes necessary to make few assumptions/guesses/hunch about the data based upon the problem statement and then test the assumption using various statistical tests to find out whether the assumption is correct or not based upon the statistically significant p value.

This will enable the analyst to establish a direction to follow and then prevents blind search and indiscriminate gathering of the data.

Few major Parametric/Standard Test – Z – Test, t – Test, F – Test, Chi-square Test, etc

Few Non-Parametric or Distribution Free Test –

One Sample Tests- Kolmogorov – smirnov one sample test, Randomness Test, One Sample Sign Test, etc

Two Sample Tests – Two-Sample Sign Test, Fisher-Irwin Test, The Median Test, etc

High Level Process Map/Steps

1. State Null Hypothesis (Ho) as well as Alternate Hypothesis (Ha)

2. Specify the level of significance or the

3. Decide the correct sampling distribution

4. Create few random samples and ascertain the appropriate value for sample data

5. Calculate the probability of the sample data

6. Is the probability of the sample data is equal or smaller than the p value based upon the significance table

7. If the p value is smaller than Reject the Null Hypothesis (Ho) or accept the Null Hypothesis (Ho) in case of other way round

For example, as pointed in my earlier post that the mean value of age of the car is 55.95/56 months. Now the assumption would be whether that 56 months is holds good for all the categories of the cars or it has different mean period for different car categories.

Null Hypothesis would be that the mean is same across all the car categories

Alternate Hypothesis would be that the mean is different for different car categories

Now we test the Hypothesis by taking the category specific data and ascertain test statistic z sample mean using

Then we need compare it with the critical value based upon the defined level of significance to accept or reject the null hypothesis.

Tuesday, 23 September 2014

Data Pre-processing – 60-70% of the effort (4/13)

It is indeed a mandatory and an important task in the process of finding a solution to the business problem using the data.

As it is a known fact that Data in real world is Dirty. The reason for this data to be dirty may be of

a) Incomplete Data – It may be incomplete because of lacking attribute values or lacking attributes of interest, etc. It might also be a case of different considerations made between the time when the data was collected and when it was analyzed. It might also occur due to Human/Hardware/Software problems. For example, it may be like missing value for an important variable/feature/attribute in the dataset – Occupation = ””

b) Noisy Data – It may be a noisy because of faulty data collection instruments or human or software or hardware error or it may be due to the data transmission from one place/position to another place/position. Mostly, these errors result into the outliers in the dataset. For example, it may be like an age of 1000 for a person in an age attribute/feature.

c) Inconsistent Data – It may be inconsistent because of dependency violation or data is getting accumulated from various sources. As a result, there is a possibility of duplicates getting generated. It might also be the case where initially data is stored in terms of integer like 1,2,3 now with the policy change; it got converted to A, B, C.

d) Data Filtering – It is also required to filter out attributes/features that are irrelevant to the current business problem. Business data warehouse would have large number of attributes as part of the product but it would be the case that only limited number of features are required to work on the defined business problem

Quality of the data is an integral part of finding qualitative solution to the business problem that we are dealing with it. There are various ways to convert the dirty data to a more meaningful and qualitative data and below are few generalized measures for the same.

1. Is data accurate or not

2. Is data complete or not – Data not recorded, un-available, not available, etc.

3. Is data consistent or not – Few data points are modified but some not, dangling, etc.

4. How timely the data is – Frequency of the data update, Is it complete data update or only the change, Any new additions of variables and its retrospective update, etc.

5. Is data believable – How trustable the data is, What is the source of the data, Is there any scope for data breakages, etc.

6. Is data Interpretable – What is the format of the data, How easily the data is understandable, etc.

Few major generalized tasks for answering those above questions those are more or less present in any dataset

1. Top level understanding

2. Shortlisting the features/variables/attributes applicable for the current problem in hand

3. Data Cleaning

a. Identifying and ascertaining the ways to fill out the missing values that are present in the features/variables

b. Smoothing of noisy data

c. Identifying the outliers and removing them from the dataset if the variable in question is not important. If the variable is very important for the problem and if there are more outliers then ascertaining a way to deal with them through statistical experiments

d. Identify the inconsistencies that would be present in the dataset

4. Data Reduction

a. Reducing the dimensions/attributes/features from the overall dataset. This would help the same results with an optimized/simplified features depending upon the problem

b. Sampling measures

5. Data Transformation and Data Discretization

a. Normalization of the data

b. Concept hierarchy generation

Once the data is turned into the qualitative asset then it is the smartness of the analyst to unearth the diamond that is present in it. I will cover the insights part later in my next note

Wednesday, 23 July 2014

Get the Underlying Data(3/13)

Now that it is clear to define the problem definition/objective, it is the time to look for the underlying data to find the possible solution for the problem statement.

Generally, there are like 2 types of data stores in the data science community

Structured Data

Structured data is stored in a fixed/pre-defined format having it saved in a record or file or a pre-defined data store. Generally, it is the combination of rows and columns saved in a fixed format representing in a specific table. Most of the business problems can be solved using the available structured data. Best part of the structured data is its ease in storing the data, querying it or analysing it. With the advancement of technological innovation, it is now becoming cheaper for the companies to have a large data store in the organisations.

Un-Structured Data

Un-structured data is a form of an information that doesn’t have a pre-defined format or organised in a pre-defined manner. It would typically be in a heavy text format consisting of numbers, facts, or speech notes or videos. Data Science community is working on some of the cool problems pertaining to text using data mining, text mining, etc. Like trying to understand the sentiment of the product in the market, ascertaining spam from the video upload, etc.

Essentially, it is the problem, which drives the requirement of the data along with the availability. Few problem statements don’t need un-structured data while few would only need it. It is also dependent upon the availability of the data even though the problem might demand for the requirement.

Availability of the data is key to find solution for the defined problem statement and it is really important to consolidate all the data from varied sources to analyse it.

Next logical step after defining the problem statement and getting access to data is to define a process/methodology to process the data for learning from it.

Tuesday, 15 July 2014

Defining Business Problem – A Directive step esp in Data Analytics(2/13)

An important initial step in the data analytics is understanding the problem in hand and smartly defining it to be able to find a solution to the problem using the data. I will try to explain you the importance and types of defining it in the data analytics space.

Generally, there could be 3 types of business problems

1. Deriving Insights from the underlying data

2. Predicting the future based upon the factors in the data

3. Optimising the outcome by learning overtime

Deriving Insights from the underlying data

This type of problem is limited only to understand the data and identify the patterns/insights that inherently exist in it. From the book titled “Competing on Analytics” by Thomas H Davenport, this type of problem is more like “What had really happened” in the past. This is more termed as Business Intelligence because it just involves the identifying inherent nature of the data. The focus of the analyst would really happens to unearth the patterns or trends that define the data overtime. The outcome from this type of problem would really be to tell business that this what had happened and there are these trends that exist in the data which are occurring very frequent historically in the data.

Predicting the future based upon the factors in the data

This type of problems leverages the statistical, data mining concepts with the data to define a function/logic for the potential future outcome. This is Data Science and it tries to address the question pertaining to “What will Happen” the skills required for solving this kind of problems are the combination knowing science with domain along with logical and business acumen.

This is a next logical step after ascertaining the trends/factors/insights from the data from the first problem definition.

For example, if the overall problem is to predict Churn for a Post-paid or a Pre-paid telecom customer then how it can be solved or what type of assumptions an analyst should consider before-hand to start attacking the problem. This particular problem statement raises 2 important questions, which will have the impact on the approach/methodology.

Question 1 – Is the problem at hand is to identify the Customer who will not recharge on the date of recharge in case of pre-paid or who will not pay the current bill in case of post-paid?

Question 2 – Is the problem at hand is to identify the Customer at Risk irrespective of his/her recharge date of bill due date?

We first need to understand the meaning of Churn – obviously if the customer stops using it then it is a first check-point and if he is not using it continuously then definitely a Churn case. It is this statement, which raises above 2 outlined questions for the problem statement.

Let me explain now in detail,

‘Question 1’ is basically prediction of the customers who will not recharge or pay the bill on the respective date of recharge or bill due-date. Here we try to look at the customers who would not re-charge or pay the bill based upon the various factors like usage, demographics, VAS, customer profile, etc.

‘Question 2’ is actually the first thing to identify than the Question 1. Here we are trying to identify customers who are going to end the usage or who are going to get out of the system in the next month/next quarter irrespective of his/her recharge date or bill due-date in the current month. The approach/methodology to be followed for this problem will change as compared to the above problem but the underlying data/factors remain same.

Both of them are really trying to solve the problem of Churn but the way it is dealt in case of Question 1 and the way in case of Question 2 will definitely change. Question 2 is more like trying to predict the customers who will not use his mobile but he will still be the customer of the service provider but whereas Question 1 is trying to predict, who will not re-charge or not pay the bill on the due-date.

It is a known fact in the telecom industry that, even if the customer recharges or pay the bills on the due-date without being using the services then he is treated as not a “good” customer. Revenues for telecom service provider is a direct function of the usage and the company is really profitable only if the customer recharges it continuously and uses it as well.

With the stiff competition in the market and with the Portability in place, service providers are actually looking to solve the above 2 problems using the Historical data combined with Data Mining Techniques.

To reiterate the point again that, it is the thorough understanding of the business problem would lead the analyst for the right approach/methodology to be followed for solving it.

Optimising the outcome by learning overtime

This type of problem tries to address “What best can happen” given the scenario with the data. This is also a type of data science applying advanced level of concepts to optimize the prediction further. It can be an algorithm which learns continuously to optimize – maximize/minimize the solution for the busines problem using the data

The focus of the Analyst towards finding a solution for the Approach/methodology would really have an impact based upon the type of the problem.

My next note – Get the underlying data