Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Category Archives: Statistics

Conditional Probability, Bayes Theorem, Naive Bayes Classifier

Both kNN and NaiveBayes are classification algorithms. Conceptually, kNN uses the idea of “nearness” to classify new entities. In kNN ‘nearness’ is modeled with ideas such as Euclidean Distance or Cosine Distance. By contrast, in NaiveBayes, the concept of ‘probability’ is used to classify new entities.

Before someone can understand and appreciate the nuances of Naive Bayes’, they need to know a couple of related concepts first, namely, the idea of Conditional Probability, and Bayes’ Rule. (If you are familiar with these concepts, skip to the section titled Getting to Naive Bayes’)

Conditional Probability in plain English: What is the probability that something will happen, given that something else has already happened. Read more of this post

Model Accuracy

— For Regression setting

Measuring Quality of Fit

For regression, Mean Squared Error (MSE) is the most commonly used measure. The MSE will be small if the predicted responses are very close to the true responses. MSE is usually calculated using training data and used on test data – called the test MSE. So the way to go is to evaluate test MSE, and select the learning method for which the test MSE is the least. Read more of this post


Today, your model—just like your jeans—seems to “hug” your sample data perfectly. But you want your jeans to fit a year or so down the road. Read more of this post

Supervised and Unsupervised Learning, Machine Learning

Machine Learning is a class of algorithms which is data-driven, i.e. unlike “normal” algorithms it is the data that “tells” what the “good answer” is. Example: an hypothetical non-machine learning algorithm for face recognition in images would try to define what a face is (round skin-like-colored disk, with dark area where you expect the eyes etc). A machine learning algorithm would not have such coded definition, but will “learn-by-examples”: you’ll show several images of faces and not-faces and a good algorithm will eventually learn and be able to predict whether or not an unseen image is a face. Read more of this post

What is Statistical Learning

Example : If we determine that there is an association between advertising and sales, then we can adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets (TV, Newspaper, Radio)

So we try to model relationship between Y (output variable – sales) and X = (X1,X2, . . .,Xp) (predictor / input variables) , which can be written in the very general form Y = f(X) + ε where ε is the error term and f is some fixed but unknown function of X1, . . . , Xp Read more of this post

Singular Value Decomposition (Also explains PCA)

The below is a reproduction of an answer in the Coursera discussion forum to the question that SVD was too complicated to understand and the material available on the web, directly goes into math instead of explaining what SVD and PCA really does.

Ive reproduced it here because it is too good and no part needs to be edited. Also once the course is archived this will not be available anymore.

Full credit goes to Pete Kazmier –


After reading more and going back to the lectures, I think I finally understand the practical aspect of SVD/PCA when it comes to a data analysis. Most of the material I found online was focused on “how” these tools work and the math behind them, which is of little interest to me. I’m much more interested in the use of the tools. In short, I drive a car to work everyday, but I don’t care how its engine is built, only that it gets me from point A to point B. The following is my attempt to help others move past these lectures with some understanding of the material and how it relates to data analysis.

Read more of this post

Principal Component Analysis

It is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in data of high dimension, where the luxury of graphical  representation is not available, PCA is a powerful tool for analysing data. The other main advantage of PCA is that once you have found these patterns in the data, and you compress the data, ie. by reducing the number of dimensions, without much loss of information. This technique used in image compression Read more of this post

Clustering, Hierarchical Clustering, K-Means Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Read more of this post

Prediction with Regression

Prediction with Linear Regression

Do the Linear Regression as mentioned in the post Linear Regression on a training set (data frame)and get a linear model and using that model, do the prediction Read more of this post

Data Analysis Landscape and MOOC courses

Here is the mindmap to the landscape diagram and the relevant MOOC’s (as of today at least)
Data Analysis