It is indeed a mandatory and an important
task in the process of finding a solution to the business problem using the
data.
As it is a known fact that Data in real
world is Dirty. The reason for this data to be dirty may be of
a)
Incomplete Data – It may be incomplete
because of lacking attribute values or lacking attributes of interest, etc. It
might also be a case of different considerations made between the time when the
data was collected and when it was analyzed. It might also occur due to
Human/Hardware/Software problems. For example, it may be like missing value for
an important variable/feature/attribute in the dataset – Occupation = ””
b) Noisy Data – It may be a noisy because of faulty data collection instruments
or human or software or hardware error or it may be due to the data
transmission from one place/position to another place/position. Mostly, these
errors result into the outliers in the dataset. For example, it may be like an
age of 1000 for a person in an age attribute/feature.
c) Inconsistent Data – It may be inconsistent because of dependency violation or data is
getting accumulated from various sources. As a result, there is a possibility
of duplicates getting generated. It might also be the case where initially data
is stored in terms of integer like 1,2,3 now with the policy change; it got
converted to A, B, C.
d) Data Filtering – It is also required to filter out attributes/features that are
irrelevant to the current business problem. Business data warehouse would have
large number of attributes as part of the product but it would be the case that
only limited number of features are required to work on the defined business
problem
Quality of the data is an integral part of
finding qualitative solution to the business problem that we are dealing with
it. There are various ways to convert the dirty data to a more meaningful and
qualitative data and below are few generalized measures for the same.
1.
Is data accurate or not
2.
Is data complete or not – Data
not recorded, un-available, not available, etc.
3.
Is data consistent or not – Few
data points are modified but some not, dangling, etc.
4.
How timely the data is –
Frequency of the data update, Is it complete data update or only the change,
Any new additions of variables and its retrospective update, etc.
5.
Is data believable – How
trustable the data is, What is the source of the data, Is there any scope for
data breakages, etc.
6.
Is data Interpretable – What is
the format of the data, How easily the data is understandable, etc.
Few major generalized tasks for answering
those above questions those are more or less present in any dataset
1.
Top level understanding
2.
Shortlisting the
features/variables/attributes applicable for the current problem in hand
3.
Data Cleaning
a.
Identifying and ascertaining
the ways to fill out the missing values that are present in the
features/variables
b.
Smoothing of noisy data
c.
Identifying the outliers and
removing them from the dataset if the variable in question is not important. If
the variable is very important for the problem and if there are more outliers
then ascertaining a way to deal with them through statistical experiments
d.
Identify the inconsistencies
that would be present in the dataset
4.
Data Reduction
a.
Reducing the dimensions/attributes/features
from the overall dataset. This would help the same results with an
optimized/simplified features depending upon the problem
b.
Sampling measures
5.
Data Transformation and Data
Discretization
a.
Normalization of the data
b.
Concept hierarchy generation
Once the data is turned into the
qualitative asset then it is the smartness of the analyst to unearth the
diamond that is present in it. I will cover the insights part later in my next
note
No comments:
Post a Comment