Predictive data science is really an art and science of
problem solving. We are using the historical data to ascertain insights,
trends, and patterns for building up a likelihood function using mathematical
and statistical techniques. This function in turn takes up the new values combined
with built-in coefficients for predicting the likely/probable value or likely/probable
good or bad customer.
This is really a process and I thought of sharing my
insights on the process that I learnt over the years through my data modelling
experience for various problems. The idea is to present the likely number of days
and the tasks involved to take up a data problem and the number of days might
differ based upon the complexity of the problem and the data
Data Deep Dive
Usually, it takes six days for an analyst to be able to
complete this task. It is one of the most primary and important step in the
workflow. It usually involves tasks like collating the data into the workspace,
understanding the data dictionary, analysing each and every variable in the
tables to ascertain the data availability, time period of availability and
scaling, remove those variables which have data quality issues and also include
those variables which are related and have importance for the next task
Problem Statement
& Feature Engineering
This is the next task after the data deep dive and it
usually takes 14 days from the project start date. This is the base for the
data and it becomes the input for any experimentation on the data. One of the
major task of the analyst would be to understand the “business problem”
in-and-out because only then the whole process of feature engineering and
technique application would depend. It usually involves understanding the
distribution of the variables in the data, ascertaining the outliers and
missing values, estimating the correlations for the variables, removing the
outliers and missing values based upon the correlation results, testing the
hypothesis with respect to variables, agreeing upon the variable scaling,
normalizing the variables using the agreed scaling, deriving the more
representative variables for summarization, etc based upon the business problem
Data Modelling
& Data Validation
Once the feature engineering process is complete, it is now
the time to consider fitting/training the chosen algorithm with the processed
data. It usually takes 23 days from the project start date to complete this
task. It usually involves tasks like checking which statistical/data mining
technique to apply for this given data and business problem, diving the data
into training, test and validation datasets, training the algorithm using
training data and testing it to check the results. If the results are good then
validation set is used to validate the predictions but if the results are not
great then the time of re-modelling, using combination of techniques, using ensemble
techniques comes into play for improving the accuracy
Report/PPT
Here the last and very important task in the whole workflow.
As I mentioned initially in this note that data science is really art and
science. Art aspect comes into play here to smartly articulate the results from
the modelling to business user. The impact of any best or worst model can be visualized
smartly to educate the user easily and also to gain the importance. It usually
takes 25 days from the project start date to complete the modelling and present
the results to the user.
The whole of objective of any data analytics project is to
map it to the business requirement and to be able to compute the ROI with the implementation
of the model developed, and it is with this report, we can conceptualize the
whole process and its return for the business
Lastly, throughout the process, it is indeed required to
have the regular dialogue with the business & the team to get the domain
into the model, to link up to the end business output and to get the outside
prospective for validation of the results
