Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Multiple Factor Regression

Multiple regression aims is to find a linear relationship between a response variable and several possible predictor variables

Model that describes more than 1 variable (model that explains the ‘Y’ part) . Can be linear or non linear (but usually is linear)

Eg Y = a + b * Factor1 + c * Factor2 + d * Factor3.

If Factor 2 doesn’t contribute much then it is Y = a + b * Factor1 + c * Factor3

How to decide whether the model is linear or quadratic ? – In multiple regression we need to assume this.!

How to select variables (factors/independent variables)?

But its OK to NOT know the independent variables (factors) used in the model at the beginning. We start with the selection anyway (this is called model selection or variable selection). In some cases we already know what we want to use. In other cases there is a choice between using all variables and using only some.

If we use all , then it may turn out that our model is too large and may hide some important factors. And its more difficult to interpret.  The best model would be a model which explains the measurement with the smallest set of factors possible. each variable should explain what it means in the context of the model.

Methods

  • AIC (described below)
  • Cross Validation (read the link in the references)

Here AIC (Akaike Information Criterion) can be used (backward model selection)

model <- lm(dataframehere)

model2 <- step(model)

AIC values will be printed out.

Start:  AIC=190.69
Fertility ~ Agriculture + Examination + Education + Catholic + 
	Infant.Mortality

		Df Sum of Sq    RSS    AIC
- Examination       1     53.03 2158.1 189.86
<none>                          2105.0 190.69
- Agriculture       1    307.72 2412.8 195.10
- Infant.Mortality  1    408.75 2513.8 197.03
- Catholic          1    447.71 2552.8 197.75
- Education         1   1162.56 3267.6 209.36

This means if Examination is removed then AIC will be 189.86. Lesser the AIC ,the better. Continue until we cant reduce further AIC. That is the ideal set of variables.

How much each factor contributes to changes of the dependent variable ?

  • Estimated coefficient for each factor and its t statistics and p value
  • F statistics and p value for each factor by doing ANOVA (described in the link in the reference below)

Example. Take summary of the linear model

summary(model2)

Call:
lm(formula = Fertility ~ Agriculture + Education + Catholic + 
	Infant.Mortality, data = swiss)

Residuals:
   Min       1Q   Median       3Q      Max 
-14.6765  -6.0522   0.7514   3.1664  16.1422 

Coefficients:
		Estimate Std. Error t value Pr(>|t|)    
(Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
Education        -0.98026    0.14814  -6.617 5.14e-08 ***
Catholic          0.12467    0.02889   4.315 9.50e-05 ***
Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 7.168 on 42 degrees of freedom
Multiple R-squared: 0.6993,     Adjusted R-squared: 0.6707 
F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10 

The adjusted R-squared is 0.67, which means that this model explains 67% of the variances that the data have. You can also see the coefficients for the model under “Estimate” column. This is the estimated coefficient for each factor Thus, the best model we have found is:

Fertility = 62.1013 – 0.1546 * Agriculture – 0.9803 * Education + 0.1247 * Catholic + 1.0784 * Infant.Mortality .

You can also calculate the confidence interval for the effect size (R-squared) as follows.


> library(MBESS)
> ci.R2(R2=0.6707, N=47, K=4)


$Lower.Conf.Limit.R2
[1] 0.4387208
$Prob.Less.Lower
[1] 0.025
$Upper.Conf.Limit.R2
[1] 0.7843012
$Prob.Greater.Upper
[1] 0.025

N is the sample size, and K is the number of factors
Thus, the effect size is 0.67 with 95% CI = [0.44, 0.78].

PS. For forward model selection read the link in the references.

Note : If we miss to include variables into the model which are important, then the model will be mis-specified. This is called omitted variable bias

How to interpret the model

> library(arm)
> coefplot(model2)

This coefplot gives you a very good understanding on how much each factor affects the dependent variable. A thin line represents the 1SD range, and a thin line represents the 2SD range

Also in the summary of the model above, Multiple R-squared (0.6993), Adjusted R-squared (0.6707), Estimate, and p values. The first two indicate the goodness of fit. The other two indicate whether each coefficient would be likely to be non-zero. For example, the above results show that all factors have significant effects (i.e., the coefficients of all factors are not non-zero) with 95% confidence.

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

References for these notes :

http://yatani.jp/HCIstats/MultipleRegression (very good explaination of the whole concept)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: