Tuesday, 23 September 2014

Data Pre-processing – 60-70% of the effort (4/13)


It is indeed a mandatory and an important task in the process of finding a solution to the business problem using the data.  

As it is a known fact that Data in real world is Dirty. The reason for this data to be dirty may be of

a)    Incomplete Data – It may be incomplete because of lacking attribute values or lacking attributes of interest, etc. It might also be a case of different considerations made between the time when the data was collected and when it was analyzed. It might also occur due to Human/Hardware/Software problems. For example, it may be like missing value for an important variable/feature/attribute in the dataset – Occupation = ””
b)   Noisy Data – It may be a noisy because of faulty data collection instruments or human or software or hardware error or it may be due to the data transmission from one place/position to another place/position. Mostly, these errors result into the outliers in the dataset. For example, it may be like an age of 1000 for a person in an age attribute/feature.
c)    Inconsistent Data – It may be inconsistent because of dependency violation or data is getting accumulated from various sources. As a result, there is a possibility of duplicates getting generated. It might also be the case where initially data is stored in terms of integer like 1,2,3 now with the policy change; it got converted to A, B, C.
d)   Data Filtering – It is also required to filter out attributes/features that are irrelevant to the current business problem. Business data warehouse would have large number of attributes as part of the product but it would be the case that only limited number of features are required to work on the defined business problem

Quality of the data is an integral part of finding qualitative solution to the business problem that we are dealing with it. There are various ways to convert the dirty data to a more meaningful and qualitative data and below are few generalized measures for the same.

1.     Is data accurate or not
2.     Is data complete or not – Data not recorded, un-available, not available, etc.
3.     Is data consistent or not – Few data points are modified but some not, dangling, etc.
4.     How timely the data is – Frequency of the data update, Is it complete data update or only the change, Any new additions of variables and its retrospective update, etc.
5.     Is data believable – How trustable the data is, What is the source of the data, Is there any scope for data breakages, etc.
6.     Is data Interpretable – What is the format of the data, How easily the data is understandable, etc.

Few major generalized tasks for answering those above questions those are more or less present in any dataset

1.     Top level understanding
2.     Shortlisting the features/variables/attributes applicable for the current problem in hand
3.     Data Cleaning
a.     Identifying and ascertaining the ways to fill out the missing values that are present in the features/variables
b.     Smoothing of noisy data
c.      Identifying the outliers and removing them from the dataset if the variable in question is not important. If the variable is very important for the problem and if there are more outliers then ascertaining a way to deal with them through statistical experiments
d.     Identify the inconsistencies that would be present in the dataset
4.     Data Reduction
a.     Reducing the dimensions/attributes/features from the overall dataset. This would help the same results with an optimized/simplified features depending upon the problem
b.     Sampling measures
5.     Data Transformation and Data Discretization
a.     Normalization of the data
b.     Concept hierarchy generation

Once the data is turned into the qualitative asset then it is the smartness of the analyst to unearth the diamond that is present in it. I will cover the insights part later in my next note