Don’t let anybody fool you; a good bit of what a data scientist does is a glorified form of counting. And if you had a few billion fingers, you yourself could do data science on one hand. In my recent work with Cassandra I’m finding that Cassandra is quite good at counting. As a matter of fact, you can treat Cassandra as a giant, distributed, redundant, “infinitely” scalable counting framework. Thus, for analysis applications that rely heavily upon accumulating counts, Cassandra can be a useful tool. In this and the next blog post we’ll take a look at a couple of big data analytics applications that rely heavily upon counting and we will demonstrate how Cassandra can be put to good use.
Let’s say that you run an online ad company. When you started the company, your customers were happy enough to just have you host their advertising and handle the dirty work. But now, with ever increasing competition, you would provide your customers with feedback analytics so that you can retain your competitive edge. And the first thing that customers want to know is how their advertising is performing. The most straightforward metric for this: conversion rate. The conversion rate is portion of all ad clicks which lead to actual purchases. This number can be represented as a probability from 0 to 1 or as a percentage from 0 to 100%. And the easiest way to estimate the conversion rate, is to count! Basically, you need to keep track of how many times each ad has been clicked and how many times such a click leads to an eventual product purchase.
Here’s how I would put it together in Cassandra. First we create a table to hold all of our counts:
CREATE TABLE conversion_rate ( company text, ad text, product text, count counter, PRIMARY KEY (company, ad, product) )
Let’s say that Apple is one of your most important clients and you hold ads specifically for iPhones, iPads, and iPods. And right now Apple is running a new ad campaign with ads corresponding to each of the above products. We will refer to these ads as iPhoneAd, iPadAd, and iPodAd respectively. When someone clicks on the iPodAd, you account for it by updating Cassandra:
UPDATE conversion_rate SET count=count+1 WHERE company='Apple' AND ad='iPodAd' AND product='NO_PRODUCT';
You might be wondering what
product='NO_PRODUCT' is about. This is the way that we encode ad clicks, regardless of whether or not a purchase was eventually made. Of course, most of the time, a user who clicks an ad will look around for a bit and then leave. But in this case, let’s assume this user goes on to buy an iPod:
UPDATE conversion_rate SET count=count+1 WHERE company='Apple' AND ad='iPodAd' AND product='iPod';
And so it goes; many different ads are shown to many different users. Some users click the ads and of those users, some go on to purchase products. Eventually, the marketing execs of Apple come to us to see how things are going with their new ad campaign. Specifically, they want to know which ads are performing best for their various products. To make this determination, all we have to do is make a single query:
SELECT ad,product,count FROM conversion_rate WHERE company = 'apple'
And the results will look something like this.
ad | product | count ----------+---------------+--------- iPadAd | iPad | 210 iPadAd | iPhone | 3 iPadAd | iPod | 152 iPadAd | NO_PRODUCT | 17533 iPhoneAd | iPad | 7 iPhoneAd | iPhone | 1409 iPhoneAd | iPod | 4 iPhoneAd | NO_PRODUCT | 28131 iPodAd | iPad | 6 iPodAd | iPhone | 2 iPodAd | iPod | 15 iPodAd | NO_PRODUCT | 11299
The results here do not directly represent the conversion rate, but if you’re thinking ahead then I bet you can see that we’re close. Rather than giving the direct counts to the Apple, we first need to divide the count of ad clicks that lead to a purchase by the total number of ad clicks. So for instance, here we see that 17533 people clicked the iPadAd and this lead to 210 iPad purchases. This implies a conversion rate of 210/17533 or roughly 1.2%. If we assemble this into a nice table for the customer, it will like something like this:
Now, at a glance, the customer will be able to immediately see how their ads are performing across all products. For instance, we can see that the iPod ad is underperforming while the iPhone ad is working quite well. But there’s is other interesting information here as well; look how many iPods are being sold through the iPad ad. With some further research, there may be opportunity here to upsale customers to whichever product, iPod or iPad, has a higher margin.
The feedback presented here is invaluable to the customer, and for you, the online ad company, gathering this data was a piece of cake! But there are a couple of things that you should look out for. 1) Be aware that you will get lower throughput with counter updates in Cassandra than with normal updates. Why? Because for each counter increment there must be a corresponding background query to make sure that the counter state is consistent. Since this is done in the background, this has little effect upon latency. 2) Counters can be a little more difficult to maintain. For instance, if growing or shrinking the cluster, care must be taken to make sure that counters remain in a consistent state. However the details of counter table maintenance is another post for another time!
Next up – if you want to go one step further, check out our next post on using Cassandra to build a Naive Bayes customer classifier. Warning: I get pretty mathy with it!