Business Intelligence – Always intelligent decisions?

In order to survive business need to know there customers. This goes all the way back to the local shopkeeper who not only the knew the regular shopping basket of their customers, but the status of their marriage and how many drinks they treated themselves to on the weekend.

The same is was true today as it was then only know businesses are faceless corporations who focus on building up a workforce of faceless replaceable employees, who interact as briefly as possible with customers to move more of them through the shop faster. Now instead of building up individual relationships with customers and suppliers businesses can understand their customers through the accusation of data.

In an age where more and more transactions are carried out digitally we are not only exchanging money for goods and services but data along with it. We are divulging shopping habits, location data, gender race and sexuality data every day and allowing businesses to build up profiles that give them a far greater insight into the behaviour of customers than the humble busy body shopkeep could ever have imagined.

In essence in order to make use of this data businesses must examine it and turn it from figures and words into useable information. From this information they gain knowledge that can be used to allow them to gain a competitive edge in the business world.

One example of the clever implementation of business data is in the work of the Cincinnati Zoo. Now famous for shooting a Gorilla that has become the face of a movement, the Cincinnati Zoo was once famous for successfully applying a program of clever data aggregation and management about it’s customers and using it to design programs, work practices, opening hours and specific membership packages to boost it’s business.

The Zoo’s new ability to make better decisions about how to optimize operations has led to dramatic improvements in sales. Comparing the six-month period immediately follow- ing the deployment of the IBM solution with the same period of the previous year, the Zoo achieved a 30.7 percent increase in food sales, and a 5.9 percent increase in retail sales. (Lauren 2013)

Here we see an exceptional use of smart data analysis to identify the trends of customers. By monitoring the behaviour of their customers they could predict actions that they would take and alter their behaviour to fit these customers. What more businesses seek however is to change completely the shopping habits of their customers. One such company is Target.

This tale of the shocking accuracy of Target’s business intelligence unit is chronicled in Charle’s Duhigg’s New York Times article “How Companies Learn Your Secrets”. The Target Corporation is the second-largest discount retailer in the United States, behind Walmart, and is a component of the S&P 500 Index. It generated $73.78 billion in revenue last year alone. But while Target sells nearly every type of consumer product imaginable they found that for many female customers they were just a place to come and buy cleaning products. Target wanted to break this association with shoppers and so hired analytics whizkid Andrew Pole.

Shopping is a habit. There are times when it’s a treat to get a new pair of whit sneakers or a fun new video game but for the most part it is  a regimented habit carried out for survival. That’s how the brain views it and that’s why we become ingrained in the process of buying the same goods in the same place. This habit can be hard to break once formed. One time however when we are vulnerable to such a shift is when we have a baby.

At this point in life people are so flustered and vulnerable that they are open to suggestion and manipulation in where they shop by simple convenience. If you can get new parents into Target to buy a stroller, then they might as well pick ip the diapers there and the baby formula and the beer etc.

With this goal in mind Pole developed a formula using customer purchasing habits measured against mothers who registered with their maternity department to analyse the shopping habits of expectant mothers. Their products of choice? Cotton balls, unscented soap and multi vitamins.

From examine these purchasing habits Cole could predict which customers were entering their second trimester and send them out a discount booklet with baby products mixed in with regular products. Subtle enough  not to arouse suspicion but forward enough to entice the shopper.

The plan however back fired when a target branch got a call from an irate father demanding to know why his teenage daughter had received vouchers for a baby carrier, stroller and formula. “Were they encouraging her to get pregnant?” The manager made a follow up call several days later to apologise yet again only to be met with an apology and something along the lines on “I need to learn more about what’s going on in my own house”

Business intelligence can create smart business opportunities but can also make you look pretty dumb.


Duhigg (2012) Accessed 12/09/2016

The Beauty of Master Data Management

In a world with ever more data being generated and ever more functionality to be attend from that data, the importance of maintaining that data has never been so crucial.Maintaining uniformity across all machines in different countries, timezones and departments can be a nightmare for big businesses. In order to gain the maximum benefit from this data and to be able to use it in the most efficient way a business needs to implement a Master Data Management Program.

Master data management (MDM) is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets. 

Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies and chart of accounts.(Gartner, 2013)

Knowing that we need to keep track of our data is a no brainer. But how do we define this data?

There are essentially five types of data in corporations:

Unstructured — This is data found in e-mail, white papers like this, magazine articles, corporate intranet portals, product specifications, marketing collateral, and PDF files.

Transactional — This is data related to sales, deliveries, invoices, trouble tickets, claims, and other monetary and non-monetary interactions.

Metadata — This is data about other data and may reside in a formal repository or in various other forms such as XML documents, report definitions, column descriptions in a database, log files, connections, and configuration files.

Hierarchical — Hierarchical data stores the relationships between other data. It may be stored as part of an accounting system or separately as descriptions of real-world relationships, such as company organizational structures or product lines. Hierarchical data is sometimes considered a super MDM domain, because it is critical to understanding and sometimes discovering the relationships between master data.

Master — Master data are the critical nouns of a business and fall generally into four groupings: people, things, places, and concepts. Further categorizations within those groupings are called subject areas, domain areas, or entity types. For example, within people, there are customer, employee, and salesperson. Within things, there are product, part, store, and asset. Within concepts, there are things like contract, warrantee, and licenses. Finally, within places, there are office locations and geographic divisions. Some of these domain areas may be further divided. Customer may be further segmented, based on incentives and history. A company may have normal customers, as well as premiere and executive customers. Product may be further segmented by sector and industry. The requirements, life cycle, and CRUD cycle for a product in the Consumer Packaged Goods (CPG) sector is likely very different from those of the clothing industry. The granularity of domains is essentially determined by the magnitude of differences between the attributes of the entities within them.(Walter and Haselden, 2006)

This Master data is some of the most important data the a company can gather. But as more and more data is generated in the everyday business processes of a company, we need to have systems in place to process and store this data in an effective way the allows those in the company who can benefit from it to have access and to protect the often sensitive information that is collected in the day to interactions between companies and consumers.

In order to enable MDM we need to follow process of

ETL: Extract, Transform, Load. A process in database usage and especially in data warehousing that performs: Data extraction – extracts data from homogeneous or heterogeneous data sources. (Wikepedia 2016)

EAI: Enterprise application integration: the use of software and computer systems’ architectural principles to integrate a set of enterprise computer applications. (Wikipedia 2016)

EII: Enterprise information integration, is the ability to support a unified view of data and information for an entire organisation.(Wikipedia 2016)

For a layman breakdown of these terms refer to this handy guide:

All of these processes come together to store information in a uniform manner that accounts for repetition and keeps data uniform, separate and accurate when applied in tandem with a strong data governance program.

Source: Gartner (2013) Accessed 26/05/2014.

Walter and Haselden (2006) Accessed 12/09/16

Wikipedia (2016),_transform,_load accessed 12/09/16

Wikipedia (2016) accessed

Wikipedia (2016) accessed 12/09/16

Association Analysis

Q1: Lift Analysis
Please calculate the following lift values for the table correlating burger and chips below:

◦ Lift(Burger, Chips)
◦ Lift(Burgers, ^Chips)
◦ Lift(^Burgers, Chips)
◦ Lift(^Burgers, ^Chips)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

Lift(Burger, Chips)
= s(B u C)/(s(B) x s(c))
= s(B u C) = (600/1400) = 0.43
= s(B) = 1000/1400 = 0.71
= s(C) = 800/1400 = 0.57
= Lift (B,C) = .43/(.71*.57)
= 1.07
= positive correlation

Lift(burgers, ^Chips)
= s (B u ^C)/(s(B) x s(^C)
= s(B U ^C) = (400/1400) = 0.29
= s(B) = 1000/1400 = 0.71
= s(^C) = 600/1400 = 0.43
= Lift (B,^C) = .29/(.71*.43)
= 0.97
= negative correlation

Lift(^Burgers, Chips)
= s(^B u C)/(S(^B) x s(C))
= s(^B u C) = 200/1400 = .14
= s(^B) = 400/1400 = .29
= s(C) = 800/1400 = .57
= Lift (^B, C) = .14/(.29*.57)
= 0.89
= Negative correlation

Lift(^Burgers, ^Chips)
= s(^b u ^C)/s(^B) x s(^C)
s(^B u ^C) = 200/1400 = .14
s(^B) = 400/1400 = .29
s(^C) = 600/1400 = 0.43
Lift(^b, ^C) = .14/(.29*.43)
= 1.08
= positive correlation

Please calculate the following lift values for the table correlating shampoo and ketchup below:

◦ Lift(Ketchup, Shampoo)
◦ Lift(Ketchup, ^Shampoo)
◦ Lift(^Ketchup, Shampoo)
◦ Lift(^Ketchup, ^Shampoo)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

Lift(Ketchup, Shampoo)
= s(K u S)/s(K) x s(S)
s(K u S) = 100/900 = .11
s(K) = 300/900 = .33
s(S) = 300/900 = .33
Lift(K, S) = .11(.33*.33)
= 1
Independent correlation

◦ Lift(Ketchup, ^Shampoo)
= s(K u ^S)/s(K) x (s(S)
s (K u ^S) = 200/900 = .22
s(K) = 300/900 = .33
s(^S) = 600/900 = .66
Lift(K ,^S) = .22/(.33*.66)
= 1
= Independent correlation

◦ Lift(^Ketchup, Shampoo)
= s(^K u S)/s(^K) x s(S)
s(^k u S) = 200/900 = .22
s(^K) = 600/900 = .66
s(S) = 300/900 = .33
Lift(^K, S) 22/(.33*.66)
= 1
= Independent correlation

◦ Lift(^Ketchup, ^Shampoo)
= s(^K u ^S)/s(^K) x s(^S)
s(^k u ^S) = 400/900 = .44
s(^K) = 600/900 = .66
s(^S) = 600/900 = .66
Lift(^K, ^S) = .44/(.66*.66)
= 1
= Independent correlation

Q3: Chi Squared Analysis
Please calculate the following chi squared values for the table correlating burger and chips below (Expected values in brackets).

◦ Burgers & Chips
◦ Burgers & Not Chips
◦ Chips & Not Burgers
◦ Not Burgers and Not Chips

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

χ2 = Sum of (Actual-Expected)2 /Expected

χ2 Burgers & Chips
χ2 = (900-800)2 /800 + (100-200)2 /200 + (300-400)2 /200 + (200-100)2 /100
=12.5 + 50 + 50 + 100 =212.5
Positive correlation (Actual is greater than expected)

χ2 Burgers & Not Chips
χ2 = (100-200)2 /200 + (300-400)2 /200 + (200-100)2 /100
= 50 + 50 + 100 = 200
= negative correlation (Expected is greater than actual)
χ2 Chips & Not Burgers
= (300-400)2 /200 + (200-100)2 /100
= 50 + 100 = 150
= negative correlation (Expected is greater than Actual

χ2 Not Chips & Not Burgers
= (200-100)2 /100
= 100
= Positive correlation (Actual was greater than expected)

Q4: Chi Squared Analysis
Please calculate the following chi squared values for the table correlating burger and sausages below (Expected values in brackets).

◦ Burgers & Sausages
◦ Burgers & Not Sausages)
◦ Sausages & Not Burgers
◦ Not Burgers and Not Sausages

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

χ2 Burgers & Sausages
(800-800)2/800 + (200-200)2/200 + (400-400)2/400 + (100-100)2/100
0 + 0 + 0 + 0 = 0

χ2 Burgers & Not Sausages
(200-200)2/200 + (400-400)2/400 + (100-100)2/100
0 + 0 + 0 = 0

χ2 Sausages & Not Burgers
(400-400)2/400 + (100-100)2/100
0 + 0 = 0

χ2 Not Burgers and Not Sausages
= 0


Under what conditions would Lift and Chi Squared analysis prove to be a poor algorithm to evaluate correlation/dependency between two events?
Please suggest another algorithm that could be used to rectify the flaw in Lift and Chi Squared?

Both prove to be a poor algorithm to evaluate correlation or dependency between two events when there are a large number of Null Transactions

Alternatively one can use:
AllConf(A, B)
Jaccard (A, B)
Cosine (A, B)
Kulczynski (A, B)
MaxConf 9A, B)

R. you ready to learn data analytics?

At long last the day comes where the data management and analytics course begins the analytics stream. The first step? An online pirate themed course laying down the basics of programming in R.

Having mastered the basics, it was time to take my first real world challenge at using and R dataset, rinsing the day and producing graphics to illustrate useful trends/information form the data.

I choose to use a data set from Kaggle with the votes and population data of the United States from it’s recent Democratic and Republican Primaries.   in looking at the data I wanted to focus it on the results from key battleground states in this year’s election i.e.:

Arizona, Colorado, Florida, Iowa, Michigan, Nevada, New Hampshire, North Carolina, Ohio, Pennsylvania, Virginia and Wisconsin.

I calculated the winners of individual counties within these states then added in some demographic data which I thought would have a weight on the results in those counties:

Mean Income, Population density, White(non-hispanic) population, hispanic population, black population, Asian population percentage of women and college degree attainment.

It produced a table like this for the Republican race:


I then created a table for both where an average county was created for each candidate’s victory


With this info it was possible to start plotting box plots and graphs to give a better overview of how the candidates in each race fared across a variety of factors. From looking how the candidates fared with certain demographic groups in their primary races we can hope to learn something about the strengths and weaknesses the posses going into the general election and see where both can improve across these states, which will hold the balance of this year’s election.

First I wanted to look at how the candidates fared against the largest electoral group, non-hispanic whites in relation to the education of this demographic


We can see here, even from this small graph That Hilary Clinton and Donald Trump won similar counties but that Hilary managed to take those more educated areas that voted for Rubio over Trump, showing she has an advantage over him in demographics with higher college graduation rates.

Next we can look at some box plots to see how much of the vote our candidates have procured for our key demographics.


The information here would seem to suggest that Donald Trump shares a somewhat similar popularity as Hilary Clinton among black voters. It is important to remember that areas with large ethnic minorities also tend to have a lower number of registered republicans, majority white who can secure wins for candidates in areas demographically opposed to their base.


The same pinch of salt can apply to our Hispanic demographics. Though we note Rubio picked up a great deal of voter share and now out of the race has a base that while Republican in registration, have deep reservations about Trump as a candidate.

Our best measurement is to look at the candidates share of the votes as fractions of the overall numbers and how they play out among our big demographics through fraction tables.


These tables show us Clinton has far greater consistency across the demographic spectrums. Trumps strength lies in lower income voters, where as we can see as income increases so clearly does Clinton support. Her popularity with college educated voters and in densely populated areas (cities) is also a distinct advantage. The numbers for voters in the democratic primaries, relative to the republicans is another strong factor for Clinton.

Trump battleground total votes: 3,997,874

Clinton battleground total votes: 5,204,921

A difference of 1.2 million votes. Though a tiny number when it comes to the numbers who will vote in the general election, it shows there is a greater enthusiasm from democratic base supporters going into the election, and with Clinton’s slight demographic advantages she starts off with a distinct advantage going into this.

Irish Population Fusion Table (2011)

In order to create this fusion table it was first necessary to find the appropriate population data and the kml (keyhole markup language) file for the Irish Republic topography. By constructing an excel table with the 2011 Irish census data for each county and uploading it to a google fusion table it was possible to plot that data over a google map to show physically where populations were distributed. With the kml data it was possible to create a heat map as it held the polygonal data for the map. By merging both tables in the Google Fusion program it was possible to integrate the specific county population data with the physical graphic overlay on the map, thus producing an accurate and aesthetically pleasing heat map showing the 2011 population data.

The distribution was broken down so as to try and capture an even distribution of the population across the buckets it was streamed into. By looking at these variances in population density along with other topographical features, we can draw certain conclusions about Ireland and it’s population distribution.

Dublin is Doublin’

Dublin is Doublin'

A disproportionate amount of the population is centralised in Dublin. At 1,273,069 Dublin has more than twice the population of the next densest county, Cork(519,032) and forty times that of the least dense, Leitrim (31,798). Cork as well has over 300,000 more residents than the next largest county, Kildare (210,312).

Hell Over Connacht


The northwest is the least populated area of the country. Noticeably these are also the most isolated from cities by both proximity and transport links.

The Northern “Sitch-e-ation”


In looking at the counties neighbouring Northern Ireland we can see some interesting distributions.

Cavan serviced not only by more direct links to Dublin, but also Enniskillen, the closest large town in the North. Though Monaghan is closer in proximity to Northern Ireland, it’s proximity is to equally underpopulated areas, poorly serviced by transport links. Donegal, while technically the  most isolated county in the Republic, with the worst transport links (no direct roads or rail inks)has a significantly larger population than many others. What it lacks in access to the Republic it makes up for in access to Northern Ireland with direct routes to Derry/Londonderry.

Room For Improvement

With greater, more detailed geographic data, there comes the ability to make even more detailed fusion tables that can illustrate more nuanced population distributions and take into account even more factors that influence them.

For example in applying data from the 1911 Census we can see the population density and distribution one hundred years prior and make certain hypothesis about the change of Ireland’s social landscape.

Ireland 1911

By applying a similar weighting system we can see clearly that the Ireland of 1911 had a far different distribution of it’s population. While still the most populace county, Dublin was no where near it’s current size and it’s surrounding counties were at the lower end of the population scale. Conversely the West and South-West in particular held a greater density of the population.

While comparing these tables to one another provides interesting insights into the change of life in Ireland over the course of that century, with more complete data it would be possible to produce even more interesting information. By tracking internal and external migration of individuals in the intervening censuses, we can track the movement of families across time, the introduction of new families and residents and the end of certain bloodlines.

With greater kml information on transportation changes we could see how these affected the changing populations of Irish counties, for the better or worse.