Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Tag Archives: Null hypothesis

Hypothesis Testing for Means, Matched Pairs, Independent Samples

Example 1 – Hypothesis test for small sample means.

  • Statement

The mean amount of waste recycled per day is more than 1 pound per person  (over the population)
Sample – 12 people . Found to be recycling avg 1.46 pounds with SD = 0.58. Alpha = 0.05

  • Parameter statement – test the claim
  • Hypothesis
    • H0 – Mean is LT or EQ 1 pound per day
    • H1 – Mean is GT 1 pound per day
  • Assumption – Data follows normal distribution (parametric)

Why does the claim go under H1 and not under H0 ? Thats because H0 always has an “equal to” under it

  • Choose test

Right tailed (because H0 has a LT in it), t-test (sample size is less than 30)

  • Calculation
xbar = 1.46 # sample mean 
mu0 = 1 # hypothesized value 
s = 0.58 # sample standard deviation 
n = 12 # sample size 
t = (xbar−mu0)/(s/sqrt(n)) 
t 
[1] 2.747391
pval = pt(t, df=n−1, lower.tail=FALSE) 
pval # upper tail p−value
[1] 0.009489493

  • Decision

At alpha = 0.5, since p value LT alpha, we have strong evidence to reject the null hypothesis

Example 2 – Hypothesis testing for matched pairs

  • Statement

Using a built-in data set named immer (In R), the barley yield in years 1931 and 1932 of the same field are recorded in various locations. Claim is that the yields are the same

  • Parameter statement – test the claim
  • Hypothesis
    • H0 – The yields are the same. That is mew(Y1-Y2) = 0
    • H1 – The yields are difference. That is mew(Y1-Y2) != 0
  • Choose test – t-test
  • Calculation
library(MASS)
head(immer)
t.test(immer$Y1, immer$Y2, paired=TRUE)

Paired t-test

data: immer$Y1 and immer$Y2
t = 3.324, df = 29, p-value = 0.002413
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6.121954 25.704713
sample estimates:
mean of the differences
15.91333

  • Decision

Assuming alpha level of 0.05, we have enough evidence to reject the null hypothesis

Alternative 1 : the detailed way of doing the calculation

library(MASS)
head(immer)
new.immer <- transform(immer, new.col=Y2-Y1)
mean(new.immer$new.col)
[1] -15.91333
> sd(new.immer$new.col)
[1] 26.2218
xbar = 15.91 # sample mean 
mu0 = 0 # hypothesized value 
sigma = 26.2218 # standard deviation 
n = 30 # sample size 
t = (xbar−mu0)/(sigma/sqrt(n)) 
t 
[1] 3.323291

Alternative 2 :  Variant of Example 2. Suppose the above dataset is not given but the means and SDs of each of the variables is provided. Also the correlation coefficient between Y1 and Y2 is provided. How do we proceed ?

Y1 mean = 109.04, Y2 mean = 93.13

SD of Y1 = 28.67, SD of Y2 = 24.27

r = 0.52

xbar <-  109.04 - 93.13
sigma <- sqrt(sd1^2 + sd2^2 -2*0.52*sd1*sd2)
mu0 = 0 # hypothesized value
n = 30 # sample size 
t = (xbar−mu0)/(sigma/sqrt(n)) 

[1] 3.323291

Example 3 – Hypothesis testing for independent samples

Using data from Example 2. Suppose the data were from independent samples. So the only difference would be to leave out the “paired=TRUE”

library(MASS)
head(immer)
t.test(immer$Y1, immer$Y2)
Welch Two Sample t-test

data: immer$Y1 and immer$Y2
t = 2.32, df = 56.463, p-value = 0.02398
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.17493 29.65174
sample estimates:
mean of x mean of y
109.04667 93.13333 

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

Referencesfor these notes :

The study material for the MOOC “Making sense of data” at Coursera.org

http://www.youtube.com/watch?v=jfUhKHX5S0E

http://www.r-tutor.com/elementary-statistics/hypothesis-testing/upper-tail-test-population-mean-unknown-variance

Advertisements

Hypothesis test for Proportions – 1 sample, 2 sample

Example 1 (Test for proportions)

Statement
Population – XYZ Intl claims that 45% of people in country ABC support banning cigarettes
Sample (real world) – 200 people are asked the above question if they want to support banning cigarettes.
49% say yes. Is there enough evidence to support claim ?

  • Parameter statement – To test the claim
  • Hypothesis
    • Null Hypothesis – H0 – Proportion of people supporting p=0.45
    • Alternative Hypothesis – H1 – Proportion of people supporting p != 0.45
  • Assumption – Data follows normal distribution (parametric)
  • Choose Test

Two tailed, Z-test, Significance level =.05

  • Calculations
pbar=0.49
p0=0.45
n=200
z = (pbar−p0)/sqrt(p0∗(1−p0)/n)
z
[1] 1.13707

The critical values at .05 significance level are

alpha = .05
 z.half.alpha = qnorm(1−alpha/2)
 c(−z.half.alpha, z.half.alpha)
[1] -1.959964  1.959964

Screenshot_122713_114210_AM

The test statistic 1.13707 lies between the critical values -1.9600 and 1.9600.


pvalue2sided=2*pnorm(-abs(z))
pvalue2sided

[1] 0.2555088
  • Decision

Hence, at .05 significance level, we have evidence not to reject the null hypothesis

Example 2 (Test for proportions)

  • Statement

Population – XYZ Intl claims that less than 44% of people in country ABC support banning cigarettes
Sample (real world) – 1046 people are asked the above question if they want to support banning cigarettes.
42% say yes. Is there enough evidence to support claim ?

  • Parameter statement – To test the claim
  • Hypothesis
    • Null Hypothesis – H0 – Proportion of people supporting p=0.44
    • Alternative Hypothesis – H1 – Proportion of people supporting p < 0.44
  • Assumption – Data follows normal distribution (parametric)
  • Choose Test

One tailed, Z-test, Significance level =.05

pbar=0.42
p0=0.44
n=1046
z = (pbar−p0)/sqrt(p0∗(1−p0)/n)
z
[1] -1.303093
alpha = .05 
z.half.alpha = qnorm(1−alpha/2)
c(−z.half.alpha, z.half.alpha) 
[1] -1.959964 1.959964
pvalue1sided=1*pnorm(-abs(z))
pvalue1sided
[1] 0.09627147
  • Decision

Hence, at .05 significance level, we have evidence to reject the null hypothesis

Example 3 (Testing differences between proportions aka comparing proportions)

  • Statement

200 random adult females and 250 random adult males were asked if they shop online. 30% females and 38% said yes. At alpha =0.1, test the claim that there is a difference in the proportion of female users and proportion of male users who shop online.

  • Parameter statement – To test the claim
  • Hypothesis
    • Null Hypothesis – H0 – Proportion of females != proportion of males => proportion of females – proportion of males = 0
    • Alternative Hypothesis – H1 – Proportion of females = proportion of males same as => proportion of females – proportion of males != 0
  • Assumption – Data follows normal distribution (parametric)
  • Choose Test

Two sample, Z-test, alpha =0.1

Use the online calculator at http://www.socscistatistics.com/tests/ztest/Default2.aspx to calculate Z and P Values

The Z-Score is -1.7746. The p-value is 0.07672. Hence, at .1 significance level, we have evidence to reject the null hypothesis

Example 4 (Independent samples – 2 sample)

Poll1 – June 2011, n1 = 1050, phat1 = 57%
Poll2 – Sep 2011, n2 = 1046, phat2 = 42%

The support in the polls have changed.

  • Hypothesis
    • H0=support did not change phat1-phat2 = 0
    • H1 = support changed phat1-phat2 != 0
  • Calculation
n1 = 1050
n2 = 1046
phat1=0.57
phat2=0.42
# number of successes
x1=round(n1*phat1,0)
x1
[1] 598
x2=round(n2*phat2,0)
x2
[1] 439
prop.test(c(x1,x2), c(n1,n2), alternative='two.sided', correct=F)
2-sample test for equality of proportions without continuity
 correction
data: c(x1, x2) out of c(n1, n2)
X-squared = 47.058, df = 1, p-value = 6.892e-12
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1075049 0.1921546
sample estimates:
 prop 1 prop 2 
0.5695238 0.4196941
  • Decision
True support anywhere between 10.8 and 19.2 %. p-value is very small which is strong evidence to reject the null hypothesis.

PS – prop.test calculates X-Square (in purple color above), which is not the test statistic we want. To calculate z-score

phat_pooled = (n1*phat1 + n2*phat2)/(n1+n2)

z=(phat1-phat2)/sqrt(phat_pooled * (1-phat_pooled)*(1/n1 + 1/n2))

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

Referencesfor these notes :

The study material for the MOOC “Making sense of data” at Coursera.org

Hypothesis Test for Proportions – YouTube

http://www.youtube.com/watch?v=h2zyqRyoCfs

Hypothesis Testing

It is based on the idea that we can tell things about the population based on a sample taken from it.

5 Steps

  1. Hypothesis
  2. Significance
  3. Sample
  4. P-Value
  5. Decide

Inferential Statistics is based on the premise that you cannot prove something to be true, but you can disprove something by finding an exception.

You decide what you want to find evidence for (H1 – there is an effect), ie the alternative hypothesis, then set up the null hypothesis (H0 – there is no effect) and find evidence to disprove it.

This is a statistical method for testing whether the factor we are talking about has any effect on our observation

In other words, this helps us decide if

  • We should believe that the relationship we found in our sample is the same as the relationship we would find if we tested the population
  • OR We should believe that the relationship we found in our sample is a coincidence due to sampling error

Read more of this post