Data Science: July 2014

Wednesday, 23 July 2014

Get the Underlying Data(3/13)

Now that it is clear to define the problem definition/objective, it is the time to look for the underlying data to find the possible solution for the problem statement.

Generally, there are like 2 types of data stores in the data science community

Structured Data

Structured data is stored in a fixed/pre-defined format having it saved in a record or file or a pre-defined data store. Generally, it is the combination of rows and columns saved in a fixed format representing in a specific table. Most of the business problems can be solved using the available structured data. Best part of the structured data is its ease in storing the data, querying it or analysing it. With the advancement of technological innovation, it is now becoming cheaper for the companies to have a large data store in the organisations.

Un-Structured Data

Un-structured data is a form of an information that doesn’t have a pre-defined format or organised in a pre-defined manner. It would typically be in a heavy text format consisting of numbers, facts, or speech notes or videos. Data Science community is working on some of the cool problems pertaining to text using data mining, text mining, etc. Like trying to understand the sentiment of the product in the market, ascertaining spam from the video upload, etc.

Essentially, it is the problem, which drives the requirement of the data along with the availability. Few problem statements don’t need un-structured data while few would only need it. It is also dependent upon the availability of the data even though the problem might demand for the requirement.

Availability of the data is key to find solution for the defined problem statement and it is really important to consolidate all the data from varied sources to analyse it.

Next logical step after defining the problem statement and getting access to data is to define a process/methodology to process the data for learning from it.

Tuesday, 15 July 2014

Defining Business Problem – A Directive step esp in Data Analytics(2/13)

An important initial step in the data analytics is understanding the problem in hand and smartly defining it to be able to find a solution to the problem using the data. I will try to explain you the importance and types of defining it in the data analytics space.

Generally, there could be 3 types of business problems

1. Deriving Insights from the underlying data

2. Predicting the future based upon the factors in the data

3. Optimising the outcome by learning overtime

Deriving Insights from the underlying data

This type of problem is limited only to understand the data and identify the patterns/insights that inherently exist in it. From the book titled “Competing on Analytics” by Thomas H Davenport, this type of problem is more like “What had really happened” in the past. This is more termed as Business Intelligence because it just involves the identifying inherent nature of the data. The focus of the analyst would really happens to unearth the patterns or trends that define the data overtime. The outcome from this type of problem would really be to tell business that this what had happened and there are these trends that exist in the data which are occurring very frequent historically in the data.

Predicting the future based upon the factors in the data

This type of problems leverages the statistical, data mining concepts with the data to define a function/logic for the potential future outcome. This is Data Science and it tries to address the question pertaining to “What will Happen” the skills required for solving this kind of problems are the combination knowing science with domain along with logical and business acumen.

This is a next logical step after ascertaining the trends/factors/insights from the data from the first problem definition.

For example, if the overall problem is to predict Churn for a Post-paid or a Pre-paid telecom customer then how it can be solved or what type of assumptions an analyst should consider before-hand to start attacking the problem. This particular problem statement raises 2 important questions, which will have the impact on the approach/methodology.

Question 1 – Is the problem at hand is to identify the Customer who will not recharge on the date of recharge in case of pre-paid or who will not pay the current bill in case of post-paid?

Question 2 – Is the problem at hand is to identify the Customer at Risk irrespective of his/her recharge date of bill due date?

We first need to understand the meaning of Churn – obviously if the customer stops using it then it is a first check-point and if he is not using it continuously then definitely a Churn case. It is this statement, which raises above 2 outlined questions for the problem statement.

Let me explain now in detail,

‘Question 1’ is basically prediction of the customers who will not recharge or pay the bill on the respective date of recharge or bill due-date. Here we try to look at the customers who would not re-charge or pay the bill based upon the various factors like usage, demographics, VAS, customer profile, etc.

‘Question 2’ is actually the first thing to identify than the Question 1. Here we are trying to identify customers who are going to end the usage or who are going to get out of the system in the next month/next quarter irrespective of his/her recharge date or bill due-date in the current month. The approach/methodology to be followed for this problem will change as compared to the above problem but the underlying data/factors remain same.

Both of them are really trying to solve the problem of Churn but the way it is dealt in case of Question 1 and the way in case of Question 2 will definitely change. Question 2 is more like trying to predict the customers who will not use his mobile but he will still be the customer of the service provider but whereas Question 1 is trying to predict, who will not re-charge or not pay the bill on the due-date.

It is a known fact in the telecom industry that, even if the customer recharges or pay the bills on the due-date without being using the services then he is treated as not a “good” customer. Revenues for telecom service provider is a direct function of the usage and the company is really profitable only if the customer recharges it continuously and uses it as well.

With the stiff competition in the market and with the Portability in place, service providers are actually looking to solve the above 2 problems using the Historical data combined with Data Mining Techniques.

To reiterate the point again that, it is the thorough understanding of the business problem would lead the analyst for the right approach/methodology to be followed for solving it.

Optimising the outcome by learning overtime

This type of problem tries to address “What best can happen” given the scenario with the data. This is also a type of data science applying advanced level of concepts to optimize the prediction further. It can be an algorithm which learns continuously to optimize – maximize/minimize the solution for the busines problem using the data

The focus of the Analyst towards finding a solution for the Approach/methodology would really have an impact based upon the type of the problem.

My next note – Get the underlying data

Sunday, 13 July 2014

Finding a Solution to the Business Problem with Data (1/13)

In general, It all bottoms down to the approach/methodology to be adopted for finding out a possible acceptable solution for the business problem and more so ever in case of analytics, it is the main important step to be adopted based upon the underlying data, domain and problem in hand

Data Approach/Methodology for a solution to a business problem

1. Define Business Problem – Is it just Deriving Insights or Prediction of the future outcome, or Optimize a particular existing outcome

2. Get the underlying data

3. Draft the first methodology to be adopted based upon the type of business problem and the underlying data

4. Data Pre-processing – Insights are here

5. Define Hypothesis and testing of Hypothesis

6. Iterate the methodology if needed à Data Pre-processing à Hypothesis Testing

7. Decide the Machine Learning /Data Mining Algorithm based upon the problem and data

8. Define Train, Test and Validation datasets

9. Run the performance statistics

10. Iterate the data with other algorithms (1 or 2 more) and compare the performance statistics

11. Select the best possible algorithm from the various models based upon the performance statistics

12. Present back the results to the business

In continuation, let me write the series of notes describing the role and importance of each step in the process of finding solution to the problem using data

Defining Business Problem – My next note