Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Distributions, Variables, Relationship between Variables

Data Types :

  • Quantitative (Discrete) – Numerical values for which math makes sense

Screenshot_112013_075303_PM

  • Categorical  (Qualitative) – Records of several categories. It can be sorted according to category. For example, shoes in a cupboard can be sorted according to colour. Can be illustrated using
  1. Bar charts
  2. Relative frequencies
  3. Pie chart
  • Nominal – In a data set males could be coded as 0, females as 1; marital status of an individual could be coded as Y if married, N if single.
  • Ordinal – It can be ranked (put in order) or have a rating scale attached.
    Varieties of biscuit and classify each biscuit on a rating scale of 1 to 5, representing strongly dislike, dislike, neutral, like, strongly like.
  • Interval Scale
    An interval scale is a scale of measurement where the distance between any two adjacents units of measurement (or ‘intervals’) is the same but the zero point is arbitrary.
    The time interval between the starts of years 1981 and 1982 is the same as that between 1983 and 1984, namely 365 days.

Ways to Visualize Data

  • Frequency Table
  • Pie Chart
  • Bar Chart
  • Dot Plot
  • Histogram
  • Stem and Leaf Plot (Trees)
  • Box and Whisker Plot (or Boxplot)
  • Scatter Plot

Distributions : Pattern of values or Data

Characteristics of Data

  • Outlier
    An outlier is an observation in a data set which is far removed in value from the others in the data set. It is an unusually large or an unusually small value compared to the others
  • Symmetry
    Symmetry is implied when data values are distributed in the same way above and below the middle of the sample.

  • Skewness

Screenshot_112013_074929_PM

Variables : Relationship between quantitative and categorical variables are generally seen using box plots (or modified box plots) and inferring data from the shapes

Graphical Representations of the Pattern

Histogram

Measurements that can be done

  1. Modes (unimodal, multimodal)
  2. extent of spread of data
  3. Symmetry of data
  4. outliers
  5. skewness (ie longer left tail means left skewed))
  6. Comparable to horizontal box plot

Bell Curve

Screenshot_112013_075048_PM

Transformation to Normality
If there is evidence of marked non-normality then we may be able to remedy this by applying suitable transformations.

The more commonly used transformations which are appropriate for data which are skewed to the right with increasing strength (positive skew) are 1/x, log(x) and sqrt(x), where the x’s are the data values.

The more commonly used transformations which are appropriate for data which are skewed to the left with increasing strength (negative skew) are squaring, cubing, and exp(x).

Relationship between Variables

  1. Both variables are categorical. We analyze an association through a comparison of conditional probabilities and graphically represent the data using contingency tables. Examples of categorical variables are gender and class standing.
  2. Both variables are quantitative. To analyze this situation we consider how one variable, called a response variable, changes in relation to changes in the other variable called an explanatory variable. Graphically we use scatterplots to display two quantitative variables. Examples are age, height, weight (i.e. things that are measured).
  3. One variable is categorical and the other is quantitative, for instance height and gender. These are best compared by using side-by-side boxplots to display any differences or similarities in the center and variability of the quantitative variable (e.g. height) across the categories (e.g. Male and Female).

Representation (Visualization) examples

  • Single Categorical Variable
    • Frequency Table
    • Bar chart
    • Pie Chart
  • Two categorical variables
    • 2 Way table
    • Side by side bar chart
    • Segmented Bar chart
  • Quantitative Variables
    • Dot plot
    • Histogram
    • Box Plot (5 Number summary)
    • Line graph
  • Multiple Quantitative Variables
    • Scatterplot

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

References for these notes :

The study material for the MOOC “Making sense of data” at Coursera.org

http://www.stats.gla.ac.uk/steps/glossary/index.html

http://sites.stat.psu.edu/~ajw13/stat200_upd/02_quantrel/01_quantrel_intro.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: