Data Science: December 2013

Friday, 20 December 2013

Lifecycle of the Predictive Data Modelling Project/Task

Predictive data science is really an art and science of problem solving. We are using the historical data to ascertain insights, trends, and patterns for building up a likelihood function using mathematical and statistical techniques. This function in turn takes up the new values combined with built-in coefficients for predicting the likely/probable value or likely/probable good or bad customer.

This is really a process and I thought of sharing my insights on the process that I learnt over the years through my data modelling experience for various problems. The idea is to present the likely number of days and the tasks involved to take up a data problem and the number of days might differ based upon the complexity of the problem and the data

Data Deep Dive

Usually, it takes six days for an analyst to be able to complete this task. It is one of the most primary and important step in the workflow. It usually involves tasks like collating the data into the workspace, understanding the data dictionary, analysing each and every variable in the tables to ascertain the data availability, time period of availability and scaling, remove those variables which have data quality issues and also include those variables which are related and have importance for the next task

Problem Statement & Feature Engineering

This is the next task after the data deep dive and it usually takes 14 days from the project start date. This is the base for the data and it becomes the input for any experimentation on the data. One of the major task of the analyst would be to understand the “business problem” in-and-out because only then the whole process of feature engineering and technique application would depend. It usually involves understanding the distribution of the variables in the data, ascertaining the outliers and missing values, estimating the correlations for the variables, removing the outliers and missing values based upon the correlation results, testing the hypothesis with respect to variables, agreeing upon the variable scaling, normalizing the variables using the agreed scaling, deriving the more representative variables for summarization, etc based upon the business problem

Data Modelling & Data Validation

Once the feature engineering process is complete, it is now the time to consider fitting/training the chosen algorithm with the processed data. It usually takes 23 days from the project start date to complete this task. It usually involves tasks like checking which statistical/data mining technique to apply for this given data and business problem, diving the data into training, test and validation datasets, training the algorithm using training data and testing it to check the results. If the results are good then validation set is used to validate the predictions but if the results are not great then the time of re-modelling, using combination of techniques, using ensemble techniques comes into play for improving the accuracy

Report/PPT

Here the last and very important task in the whole workflow. As I mentioned initially in this note that data science is really art and science. Art aspect comes into play here to smartly articulate the results from the modelling to business user. The impact of any best or worst model can be visualized smartly to educate the user easily and also to gain the importance. It usually takes 25 days from the project start date to complete the modelling and present the results to the user.

The whole of objective of any data analytics project is to map it to the business requirement and to be able to compute the ROI with the implementation of the model developed, and it is with this report, we can conceptualize the whole process and its return for the business

Lastly, throughout the process, it is indeed required to have the regular dialogue with the business & the team to get the domain into the model, to link up to the end business output and to get the outside prospective for validation of the results

Monday, 16 December 2013

Data Science or Data Scientist

Each and every company in the world is eyeing to leverage the use of data lying with them since ages and due to which there is lot of demand for the professionals who are into the data side of it.

Majority of the universities, institutes across the world have started offering the courses in the data science to accommodate the demand for data analytics professionals in the industry.

Here comes the tricky question as to how to identify the data science requirement in the organization and it is really mapped to the organization’s genesis. If an organization is a product development company then it is really the requirement for a data scientist kind of role and if the organization is a services or a solutions company then it is more of a data science role.

Let me explain the difference between these roles, Data Scientist is more of a person whose expertise is more into the technology (a “techie”), where the skill set would be to tackle the engineering problem with respect to the data. Data scientist view to the problem will generally be on how to automate or how to process the data efficiently within the time & quality constraints for the next task or how to summarize the data based upon the characteristics of the underlying data. Mostly, the primary objective of the data scientist is to create the scalable algorithms for the data problems and which will also be able to convert a commercial viable product out of it. Majority of the data scientists in the world are techie driven and as likely pointed by major tech groups including HBR is it is the sexiest job of the 20^th century and I would extend saying if the data scientist has through knowledge of data science including data mining, statistical mining techniques then it would be more than the sexiest job of not just 20^th century but for the whole life.

The other side of the coin is Data Science professional, who are not techie but has in-depth knowledge, exposure, experience on the purely Data Science part of it. Data science professionals’ skillset would always be finding a suitable approach/methodology for a data problem/data prediction using the various statistical and data mining techniques. The primary objective of the professional is to create experiments for predicting or optimizing in order to train an algorithm with the data and test the results with the validation/test datasets. Most of the time, these professionals spend majority of the time on getting the right problem statement and preparing the data for choosing the appropriate algorithm. Data Science professionals will have rich knowledge on the concepts/techniques and they use them with the experience to get the viable solution for the problem.

Majority of the tech companies, product companies are trying to replace the second side of the coin with the product. It is like given a business problem and the data, product should process all the steps required for data pre-processing, choosing the right algorithm, validating the results, coming up with the equation for prediction, etc. for the “business user”.

I feel it is a long way to go to reach to that level and it is always to have both the roles in the organization who can complement each other and achieve the best possible result