Had to take a small break for my series of
posts due to critical work commitments. Had an opportunity to work on some
special problems pertaining to Un-Supervised & Semi-Supervised learning. I
will try to create a separate post on the learning’s from the problems.
Coming back to my
series - Hypothesis Testing
Next step after knowing the summaries is to
create/assume few hypothesis based upon the outputs. Most of the assumptions
would be on the distribution of the data and in majority of the cases, data is
not at all uniformly distributed. Hence, it becomes necessary to make few
assumptions/guesses/hunch about the data based upon the problem statement and
then test the assumption using various statistical tests to find out whether
the assumption is correct or not based upon the statistically significant p
value.
This will enable the analyst to establish a
direction to follow and then prevents blind search and indiscriminate gathering
of the data.
Few major Parametric/Standard Test – Z –
Test, t – Test, F – Test, Chi-square Test, etc
Few Non-Parametric or Distribution Free
Test –
One Sample Tests- Kolmogorov – smirnov one
sample test, Randomness Test, One Sample Sign Test, etc
Two Sample Tests – Two-Sample Sign Test,
Fisher-Irwin Test, The Median Test, etc
High Level Process Map/Steps
1.
State Null Hypothesis (Ho) as
well as Alternate Hypothesis (Ha)
2.
Specify the level of
significance or the
3.
Decide the correct sampling
distribution
4.
Create few random samples and
ascertain the appropriate value for sample data
5.
Calculate the probability of
the sample data
6.
Is the probability of the
sample data is equal or smaller than the p value based upon the significance
table
7.
If the p value is smaller than
Reject the Null Hypothesis (Ho) or accept the Null Hypothesis (Ho) in case of
other way round
For example, as pointed in my earlier post
that the mean value of age of the car is 55.95/56 months. Now the assumption
would be whether that 56 months is holds good for all the categories of the
cars or it has different mean period for different car categories.
Null Hypothesis would be that the mean is
same across all the car categories
Alternate Hypothesis would be that the mean
is different for different car categories
Now we test the Hypothesis by taking the
category specific data and ascertain test statistic z sample mean using
Then we need compare it with the critical
value based upon the defined level of significance to accept or reject the null
hypothesis.