Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

ANOVA

Analysis of variance

  • Used to compare differences of means among more than 2 groups.
  • Compares the amount of variation between groups with the amount of variation within groups.
  • Is a method to compare multiple linear models

When we take samples from a population, we expect each sample mean to differ simply because we are taking a sample rather than measuring the whole population; this is called sampling error but is often referred to more informally as the effects of “chance”. Thus, we always expect there to be some differences in means among different groups. The question is: is the difference among groups greater than that expected to be caused by chance? In other words, is there likely to be a true (real) difference in the population mean?

Assumptions of ANOVA

  •  The response is normally distributed
  • Variance is similar within different groups
  • The data points are independent

One Way Analysis of Variance
The one way analysis of variance allows us to compare several groups of observations, all of which are independent but possibly with a different mean for each group

Two Way Analysis of Variance
Two Way Analysis of Variance is a way of studying the effects of two factors separately (their main effects)

F Distribution

F is the ratio of two variances.

The F-distribution is most commonly used in Analysis of Variance (ANOVA) and the F test (to determine if two variances are equal).  The F-distribution is the ratio of two chi-square distributions, and hence is right skewed (see below for an example). It has a minimum of 0, but no maximum value (all values are positive).  The peak of the distribution is not far from 0

The basic idea of an F test is to look at the ratio of the two variances calculated from your sampled data. And it has the null hypothesis such that two normal populations have the same variances. The test cares about the variances of the populations, not the ones of the sampled data. If the ratio (which is called F value) is too extreme, you will reject the null hypothesis, and say that you find a significant difference. But the idea of F value–taking a ratio of two values and checking whether it is extreme or not–is a key concept in ANOVA and regression

Summary calculation of ANOVA : 

Take a sample and measure variances at different points in time. Calculate sum of squares, both, within groups and between groups.

Put both groups together and calculate total sum of squaresScreenshot_112513_104722_PM

Calculate F Ratio and see if it falls into the rejection region

Screenshot_112513_105018_PM

The whole procedure is described below.

Total sum of squares = Sum of squares between groups + sum of squares within groups. 

How to get sum of squares within groups ?

Take mean of each of the groups (Eg blue and red in pic above). Calculate mean of the group. Then subtract mean from each observation. Square each value and add it up. Called SSW

Screenshot_112513_110641_PM

How to get total sum of squares ?

Put all the above groups together and do the same calculation. Also called SST

Now its easy to find the sum of squares between groups

Next

SS between groups / DF(numerator)        -> Here DF = no of groups -1                                               ->> (1)

SS within groups / DF (denominator) -> Here DF = observations in a group – no of groups         ->> (2)

F Score = F Ratio = (1) / (2)

Denoted by F( DF(numerator)DF (denominator)) = Fscore

Look up the combination in the table and get critical value. 

Based on where the F Ratio falls either reject NH or fail to reject NH

If F-Score  > Critical value -> we reject null hypothesis.

Comparing ANOVA  and t-Test

t-test is used when comparing two groups => ANOVA is used for comparing more than 2 groups.

p-value using ANOVA for 2 groups = t-test.

Comparing ANOVA and linear regression

Regression is the same as ANOVA

  • ANOVA calculates means and deviations of data from the means.
  • linear regression, the best fit line through the data is calculated and also the deviations of the data from this line. The F ratio can be calculated in both.

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

In R :


R functions such as aov( ), lm( ), and glm( ) use a formula interface to specify the variables to be included in the analysis. The formula determines the model that will be built (and tested) by the R procedure. The basic format of such a formula is

response variable ~ explanatory variables

A basis regression analysis would be formulated this way…

y ~ x

Additional explanatory variables would be

y ~ x + z


Referencesfor these notes : 

http://www.youtube.com/watch?v=-yQb_ZJnFXw&list=SP3A0F3CC5D48431B3.

PS. The channel “Statisticsfun” on YouTube is really good 🙂

http://www.edanzediting.com/blog/statistics_anova_explained#.UpSgfnfCtzU

http://www.statstodo.com/FTest_Exp.php (A detailed explanation of F Test)

http://www.stats.gla.ac.uk/steps/glossary/anova.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: