Association Analysis

Q1: Lift Analysis
Please calculate the following lift values for the table correlating burger and chips below:

◦ Lift(Burger, Chips)
◦ Lift(Burgers, ^Chips)
◦ Lift(^Burgers, Chips)
◦ Lift(^Burgers, ^Chips)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

Lift(Burger, Chips)
= s(B u C)/(s(B) x s(c))
= s(B u C) = (600/1400) = 0.43
= s(B) = 1000/1400 = 0.71
= s(C) = 800/1400 = 0.57
= Lift (B,C) = .43/(.71*.57)
= 1.07
= positive correlation

Lift(burgers, ^Chips)
= s (B u ^C)/(s(B) x s(^C)
= s(B U ^C) = (400/1400) = 0.29
= s(B) = 1000/1400 = 0.71
= s(^C) = 600/1400 = 0.43
= Lift (B,^C) = .29/(.71*.43)
= 0.97
= negative correlation

Lift(^Burgers, Chips)
= s(^B u C)/(S(^B) x s(C))
= s(^B u C) = 200/1400 = .14
= s(^B) = 400/1400 = .29
= s(C) = 800/1400 = .57
= Lift (^B, C) = .14/(.29*.57)
= 0.89
= Negative correlation

Lift(^Burgers, ^Chips)
= s(^b u ^C)/s(^B) x s(^C)
s(^B u ^C) = 200/1400 = .14
s(^B) = 400/1400 = .29
s(^C) = 600/1400 = 0.43
Lift(^b, ^C) = .14/(.29*.43)
= 1.08
= positive correlation

Q2:
Please calculate the following lift values for the table correlating shampoo and ketchup below:

◦ Lift(Ketchup, Shampoo)
◦ Lift(Ketchup, ^Shampoo)
◦ Lift(^Ketchup, Shampoo)
◦ Lift(^Ketchup, ^Shampoo)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

Lift(Ketchup, Shampoo)
= s(K u S)/s(K) x s(S)
s(K u S) = 100/900 = .11
s(K) = 300/900 = .33
s(S) = 300/900 = .33
Lift(K, S) = .11(.33*.33)
= 1
Independent correlation

◦ Lift(Ketchup, ^Shampoo)
= s(K u ^S)/s(K) x (s(S)
s (K u ^S) = 200/900 = .22
s(K) = 300/900 = .33
s(^S) = 600/900 = .66
Lift(K ,^S) = .22/(.33*.66)
= 1
= Independent correlation

◦ Lift(^Ketchup, Shampoo)
= s(^K u S)/s(^K) x s(S)
s(^k u S) = 200/900 = .22
s(^K) = 600/900 = .66
s(S) = 300/900 = .33
Lift(^K, S) 22/(.33*.66)
= 1
= Independent correlation

◦ Lift(^Ketchup, ^Shampoo)
= s(^K u ^S)/s(^K) x s(^S)
s(^k u ^S) = 400/900 = .44
s(^K) = 600/900 = .66
s(^S) = 600/900 = .66
Lift(^K, ^S) = .44/(.66*.66)
= 1
= Independent correlation

Q3: Chi Squared Analysis
Please calculate the following chi squared values for the table correlating burger and chips below (Expected values in brackets).

◦ Burgers & Chips
◦ Burgers & Not Chips
◦ Chips & Not Burgers
◦ Not Burgers and Not Chips

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

χ2 = Sum of (Actual-Expected)2 /Expected

χ2 Burgers & Chips
χ2 = (900-800)2 /800 + (100-200)2 /200 + (300-400)2 /200 + (200-100)2 /100
=12.5 + 50 + 50 + 100 =212.5
Positive correlation (Actual is greater than expected)

χ2 Burgers & Not Chips
χ2 = (100-200)2 /200 + (300-400)2 /200 + (200-100)2 /100
= 50 + 50 + 100 = 200
= negative correlation (Expected is greater than actual)
χ2 Chips & Not Burgers
= (300-400)2 /200 + (200-100)2 /100
= 50 + 100 = 150
= negative correlation (Expected is greater than Actual

χ2 Not Chips & Not Burgers
= (200-100)2 /100
= 100
= Positive correlation (Actual was greater than expected)

Q4: Chi Squared Analysis
Please calculate the following chi squared values for the table correlating burger and sausages below (Expected values in brackets).

◦ Burgers & Sausages
◦ Burgers & Not Sausages)
◦ Sausages & Not Burgers
◦ Not Burgers and Not Sausages

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

χ2 Burgers & Sausages
(800-800)2/800 + (200-200)2/200 + (400-400)2/400 + (100-100)2/100
0 + 0 + 0 + 0 = 0
Independent

χ2 Burgers & Not Sausages
(200-200)2/200 + (400-400)2/400 + (100-100)2/100
0 + 0 + 0 = 0
Independent

χ2 Sausages & Not Burgers
(400-400)2/400 + (100-100)2/100
0 + 0 = 0
Independent

χ2 Not Burgers and Not Sausages
(100-100)2/100
= 0
Independent

Q5:

Under what conditions would Lift and Chi Squared analysis prove to be a poor algorithm to evaluate correlation/dependency between two events?
Please suggest another algorithm that could be used to rectify the flaw in Lift and Chi Squared?

Both prove to be a poor algorithm to evaluate correlation or dependency between two events when there are a large number of Null Transactions

Alternatively one can use:
AllConf(A, B)
Jaccard (A, B)
Cosine (A, B)
Kulczynski (A, B)
MaxConf 9A, B)

R. you ready to learn data analytics?

At long last the day comes where the data management and analytics course begins the analytics stream. The first step? An online pirate themed course laying down the basics of programming in R.

http://tryr.codeschool.com/levels/8/challenges/1

Having mastered the basics, it was time to take my first real world challenge at using and R dataset, rinsing the day and producing graphics to illustrate useful trends/information form the data.

I choose to use a data set from Kaggle with the votes and population data of the United States from it’s recent Democratic and Republican Primaries.   in looking at the data I wanted to focus it on the results from key battleground states in this year’s election i.e.:

Arizona, Colorado, Florida, Iowa, Michigan, Nevada, New Hampshire, North Carolina, Ohio, Pennsylvania, Virginia and Wisconsin.

I calculated the winners of individual counties within these states then added in some demographic data which I thought would have a weight on the results in those counties:

Mean Income, Population density, White(non-hispanic) population, hispanic population, black population, Asian population percentage of women and college degree attainment.

It produced a table like this for the Republican race:

VotesTableRepublican
VotesTableRepublican

I then created a table for both where an average county was created for each candidate’s victory

DemocratsAverageCountyVotesTable
DemocratsAverageCountyVotesTable
RepublicanAverageCountyWinVoteTable
RepublicanAverageCountyWinVoteTable

With this info it was possible to start plotting box plots and graphs to give a better overview of how the candidates in each race fared across a variety of factors. From looking how the candidates fared with certain demographic groups in their primary races we can hope to learn something about the strengths and weaknesses the posses going into the general election and see where both can improve across these states, which will hold the balance of this year’s election.

First I wanted to look at how the candidates fared against the largest electoral group, non-hispanic whites in relation to the education of this demographic

RepublicanWhiteVsEducationalAttainmentVote
RepublicanWhiteVsEducationalAttainmentVote
DemocratsWhiteVsEducationalAttainment
DemocratsWhiteVsEducationalAttainment

We can see here, even from this small graph That Hilary Clinton and Donald Trump won similar counties but that Hilary managed to take those more educated areas that voted for Rubio over Trump, showing she has an advantage over him in demographics with higher college graduation rates.

Next we can look at some box plots to see how much of the vote our candidates have procured for our key demographics.

RepublicanBlackVoteBoxPlot
RepublicanBlackVoteBoxPlot
DemocratsBlackVoteBoxplot
DemocratsBlackVoteBoxplot

The information here would seem to suggest that Donald Trump shares a somewhat similar popularity as Hilary Clinton among black voters. It is important to remember that areas with large ethnic minorities also tend to have a lower number of registered republicans, majority white who can secure wins for candidates in areas demographically opposed to their base.

RepublicanHispanicVoteBoxplot
RepublicanHispanicVoteBoxplot
DemocratsHsipanicVoteBoxplot
DemocratsHsipanicVoteBoxplot

The same pinch of salt can apply to our Hispanic demographics. Though we note Rubio picked up a great deal of voter share and now out of the race has a base that while Republican in registration, have deep reservations about Trump as a candidate.

Our best measurement is to look at the candidates share of the votes as fractions of the overall numbers and how they play out among our big demographics through fraction tables.

TrumpFractionTable
TrumpFractionTable
ClintonFractionTable
ClintonFractionTable

These tables show us Clinton has far greater consistency across the demographic spectrums. Trumps strength lies in lower income voters, where as we can see as income increases so clearly does Clinton support. Her popularity with college educated voters and in densely populated areas (cities) is also a distinct advantage. The numbers for voters in the democratic primaries, relative to the republicans is another strong factor for Clinton.

Trump battleground total votes: 3,997,874

Clinton battleground total votes: 5,204,921

A difference of 1.2 million votes. Though a tiny number when it comes to the numbers who will vote in the general election, it shows there is a greater enthusiasm from democratic base supporters going into the election, and with Clinton’s slight demographic advantages she starts off with a distinct advantage going into this.