Data Science: December 2014

Had to take a small break for my series of posts due to critical work commitments. Had an opportunity to work on some special problems pertaining to Un-Supervised & Semi-Supervised learning. I will try to create a separate post on the learning’s from the problems.

Coming back to my series - Hypothesis Testing

Next step after knowing the summaries is to create/assume few hypothesis based upon the outputs. Most of the assumptions would be on the distribution of the data and in majority of the cases, data is not at all uniformly distributed. Hence, it becomes necessary to make few assumptions/guesses/hunch about the data based upon the problem statement and then test the assumption using various statistical tests to find out whether the assumption is correct or not based upon the statistically significant p value.

This will enable the analyst to establish a direction to follow and then prevents blind search and indiscriminate gathering of the data.

Few major Parametric/Standard Test – Z – Test, t – Test, F – Test, Chi-square Test, etc

Few Non-Parametric or Distribution Free Test –

One Sample Tests- Kolmogorov – smirnov one sample test, Randomness Test, One Sample Sign Test, etc

Two Sample Tests – Two-Sample Sign Test, Fisher-Irwin Test, The Median Test, etc

High Level Process Map/Steps

1. State Null Hypothesis (Ho) as well as Alternate Hypothesis (Ha)

2. Specify the level of significance or the

3. Decide the correct sampling distribution

4. Create few random samples and ascertain the appropriate value for sample data

5. Calculate the probability of the sample data

6. Is the probability of the sample data is equal or smaller than the p value based upon the significance table

7. If the p value is smaller than Reject the Null Hypothesis (Ho) or accept the Null Hypothesis (Ho) in case of other way round

For example, as pointed in my earlier post that the mean value of age of the car is 55.95/56 months. Now the assumption would be whether that 56 months is holds good for all the categories of the cars or it has different mean period for different car categories.

Null Hypothesis would be that the mean is same across all the car categories

Alternate Hypothesis would be that the mean is different for different car categories

Now we test the Hypothesis by taking the category specific data and ascertain test statistic z sample mean using

Then we need compare it with the critical value based upon the defined level of significance to accept or reject the null hypothesis.

Data Science

Saturday, 20 December 2014

Define Hypothesis and testing of Hypothesis(5/13)