Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Clustering, Hierarchical Clustering, K-Means Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Read more of this post

Prediction with Regression

Prediction with Linear Regression

Do the Linear Regression as mentioned in the post Linear Regression on a training set (data frame)and get a linear model and using that model, do the prediction Read more of this post

Data Analysis Landscape and MOOC courses

Here is the mindmap to the landscape diagram and the relevant MOOC’s (as of today at least)
Data Analysis

Bootstrapping, Bagging, Boosting and Random Forest

To understand bootstrap, suppose it were possible to draw repeated samples (of the same size) from the population of interest, a large number of times. Then, one would get a fairly good idea about the sampling distribution of a particular statistic from the collection of its values arising from these repeated samples. But, that does not make sense as it would be too expensive and defeat the purpose of a sample study. The purpose of a sample study is to gather information cheaply in a timely fashion. The idea behind bootstrap is to use the data of a sample study at hand as a “surrogate population”, for the purpose of approximating the sampling distribution of a statistic; i.e. to resample (with replacement) from the sample data at hand and create a large number of “phantom samples” known as bootstrap samples. The sample summary is then computed on each of the bootstrap samples (usually a few thousand). A histogram of the set of these computed values is referred to as the bootstrap distribution of the statistic. Read more of this post

Moving Average and Smoothing

A moving average (rolling average or running average) is a calculation to analyse data points by creating a series of averages of different subsets of the full data set

Given a series of numbers and a fixed subset size, the first element of the moving average is obtained by taking the average of the initial fixed subset of the number series. Then the subset is modified by “shifting forward”; that is, excluding the first number of the series and including the next number following the original subset in the series. This creates a new subset of numbers, which is averaged. This process is repeated over the entire data series. The plot line connecting all the (fixed) averages is the moving average. Read more of this post

Decision Trees and Prediction with trees

A decision tree is a kind of flowchart — a graphical representation of the process for making a decision or a series of decisions. Read more of this post

Cross Validation

Cross-​​validation is pri­mar­ily a way of mea­sur­ing the pre­dic­tive per­for­mance of a sta­tis­ti­cal model. Every sta­tis­ti­cian knows that the model fit sta­tis­tics are not a good guide to how well a model will pre­dict: high R^2 does not nec­es­sar­ily mean a good model. It is easy to over-​​fit the data by includ­ing too many degrees of free­dom and so inflate R^2 and other fit sta­tis­tics. For exam­ple, in a sim­ple poly­no­mial regres­sion I can just keep adding higher order terms and so get bet­ter and bet­ter fits to the data. But the pre­dic­tions from the model on new data will usu­ally get worse as higher order terms are added. Read more of this post

Prediction Study Design

Steps in building a prediction
1. Find the right data
2. Define your error rate
3. Split data into:

  • Training
  • Testing
  • Validation (Optional)

4. On the training set pick features
5. On the training set pick prediction function
6. On the training set cross-validate
7. If no validation – apply 1x to test set
8. If validation – apply to test set and refine
9. If validation – apply 1x to validation Read more of this post

ANOVA

Analysis of variance

  • Used to compare differences of means among more than 2 groups.
  • Compares the amount of variation between groups with the amount of variation within groups.
  • Is a method to compare multiple linear models Read more of this post

Confidence Interval Examples

Example 1 :  Confidence interval for proportions

Statement : In a poll of 200 people, 152 people had a computer. Estimate the proportion of people who have at least 1 computer at 95% confidence interval

Screenshot_123013_111229_PM

Calculation :

prop.test(152,200,conf.level=0.95)
1-sample proportions test with continuity correction
data: 152 out of 200, null probability 0.5
X-squared = 53.045, df = 1, p-value = 3.26e-13
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.6936108 0.8161811
sample estimates:
 p 
0.76

Alternative 1 :

for 95% conf int,

zstar = 1.96
n = 200
phat = 152/200 # sample proportion
SE = sqrt(phat∗(1−phat)/n); # standard error
MOE = zstar * SE # Margin of error
CI = phat + c(−MOE, MOE)
CI
[1] 0.7008093 0.8191907

So Basically, the point estimate is 0.76 and the margin of error is +/- 0.059

We are 95% confident that at the proportion of people with at least 1 computer is between 70% and 81.9%
Example 2 – Sample size for estimating proportions

In the above example the MOE was 5.9%. What sample size do we need for getting a MOE of 3%

using the formula above,

n=phat*(1-phat)*(zstar/MOE)^2

substitute MOE to be 0.03 and we get n = 778.54

Example 3 – Confidence Intervals about the Mean, Population Standard Deviation Unknown

25 test takers had a mean of 520 marks with a SD of 80. Construct a 95% confidence interval about the mean

Screenshot_123013_105717_PM

s = 80
n = 25
SE = s/sqrt(n)
SE
[1] 16
MOE = qt(.975, df=n−1)∗SE
MOE
[1] 33.02238
xbar = 520
xbar + c(−MOE, MOE)
[1] 486.9776 553.0224

Example 4 : Confidence Intervals about the Mean, Population Standard Deviation Known

Screenshot_123013_110824_PM

Calculations, similar to Eg 3. Only difference is that the MOE calculation

MOE = qnorm(.975)∗SE

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

Referencesfor these notes :

The study material for the MOOC “Making sense of data” at Coursera.org

http://www.youtube.com/watch?v=vrod7OScpC4&list=PL568547ACA9211CCA&index=51