# Detecting Reddit Voting Rings Using This Weird Little Data Trick

John Berryman — October 14, 2013 | 4 Comments | Filed in: Analytics, Big Data Tagged: , |

# A wolf in sheep’s clothing

Wow, so apparently the Reddit community has outed (and as of 15 minutes ago, ousted) colleague Doug Turnbull as the vile spammer that he is! Apparently he has been shamelessly promoting his own blog posts! And among his tactics, the use of voting rings. Yes, and even some of my own colleagues succumbed to his pressure to upvote his sales pitches – including this one masked as a post about test driven search relevancy. (Though I must admit… it is an interesting read… and Doug is an eloquent writer)

But not I. No, I’m as white as the wind driven snow, as pure as a mountain stream, as innocent as a newborn babe. And as a matter of fact, this has all got me thinking: How can I help prevent things like this from happening in the future? In particular, how can I help Reddit prevent voting rings from forming in the future? And then it occurred to me: the HyperLogLog counter.

# What is the HyperLogLog counter?

The HyperLogLog counter is a really neat probabilistic data structure that estimates the count of unique elements in a list. For instance, let’s say you have a really popular web page and you want to perfectly count the number of unique visitors. To keep a count of uniques, you have to store every IP address that you ever see. And upon receiving a new IP address, you have to first check that the new IP address has not been run across before, and only then do you increment the site counter. Under the best of situations, the storage and the computation probably scale as O(log(n)). (What would you use? Tries? Skip-lists?)

With the HyperLogLog counter it’s all different. The size of the data structure is determined a priori (thus O(1)) and is a miniscule fraction of whatever data structure you would use from in the example of the previous paragraph. Data insertion and count tallying are super fast and also scales as O(1). But here’s the catch, you can’t actually know the exact count of unique views. Rather, you may only have an estimate of the current count. I encourage you to read more. This article has been the best resource I’ve found. And make sure you try their cool JavaScript demo which really helps to understand how it all works.

# But how can the HyperLogLog counter help beat voting rings??

Let’s turn back to the sad example of disgraced colleague Doug Turnbull. Usually when Doug coerces others to upvote his posts he uses forms of extortion or even threats of physical harm. Naturally I’ve never succumbed to this, but those that do go to Doug’s most recent post (many of which can be found here) and upvote.

But consider what would happen if we created a HyperLogLog counter for every user on Reddit, and any time that a user receives an upvote, we update the corresponding HyperLogLog counter with the id of the user who did the upvoting. Given this setup, here’s how we detect the voting ring. First, for a given user, retrieve the estimated count of unique upvotes from that user’s HyperLogLog counter and let’s label this quantity as `estimated_unique_upvotes`. Next, tally up the total number of actual upvotes…

these things…
across all of that user’s posts. Let’s call this quantity as

`global_total_upvotes`. The probability that a given user is involved in a voting ring will correlate highly with the ratio of these numbers. In other words we can define

``````voting_honesty  =  estimated_unique_upvotes / global_total_upvotes
``````

Think about it. If a user is not in a voting ring, then their upvotes are going to be organically generated from anyone one the world-wide-web that thinks their links are interesting. Because the internet is a big place, this user’s `estimated_unique_upvotes` and `global_total_upvotes` will tend toward the same number and their `voting_honesty` “probability” will tend toward 1. On the other hand, a user wrapped up in seedy underworld of Reddit voting rings will have many more `global_total_upvotes` across all posts than they have `estimated_unique_upvotes`. For this user, the `voting_honesty` will tend toward 0.

# Conclusion

Now we’re talking 1’s and 0’s. Aren’t computer’s supposed to be good with those? Yes! So the method I’ve outlined above should be a scalable way to automatically detect and expel those voting charlatans! Are there issues with this strategy? Sure. But I do think that the `estimated_unique_upvotes` to `global_total_upvotes` ratio is interesting. Help me think; where could this be used outside of Reddit voting rings?

# Update Idea: Throwing Away 90% of Reddit’s Infrastructure and Using HLL Approximations Instead.

Hey, I got another idea! I wonder what reddit’s infrastructure looks like? It’s gotta be hell to scale everything up. The most obviously complex thing is tracking all those up and down votes. Furthermore, the most important function of the infrastructure is in making sure that no one votes for a post more than once – otherwise colleague, Doug Turnbull, would cramp to death setting at the computer and click-upvoting his own posts! Since it’s so hard to scale, why not just rip all that stuff out and replace it with HLL approximations instead.

Here’s how: Each post gets 2 HLL counters, one positive votes and one negative. When a user upvotes or downvotes a post, their id is submitted to the corresponding HLL counter. There is never a chance that a vote can be counted twice, and there’s no need for any of the infrastructure that ensures that no one votes twice.

But there are a couple of obvious drawbacks to this approach. One – the vote is now just an estimate – a guess, and while one average this would work out fine, there would certainly be crappy posts getting promoted and great posts getting penalized. Secondly, a side effect of all this click tracking infrastructure is that you lose the ability to trace your own history! And if you again built infrastructure so that you could see your history, you might as well return back to the way Reddit currently does it. But at the very least this idea is worth remembering when building and scaling a website that has, but doesn’t feature, votes.

## 4 comments on “Detecting Reddit Voting Rings Using This Weird Little Data Trick”

1. John,
Interesting idea indeed. I think you would have to look at the actual data for reddit upvotes to see how well this works in practice. The biggest issue I see is that for extremely random upvoters this value is 1 and for tight knit spam networks this value is 1/# of posts. My guess is that interesting posts on reddit are often upvoted by a smallish community of people, say /r/programming so that “good” posts don’t have a value close to 1 but more in the middle. We would have to see if this “middle of goodness” is close to the 1/# of posts value of spam networks. My guess is the error and area of detectability is small, but you never know. Cool idea!

2. Hey Matt, thanks for your feedback. Yeah… I basically agree. There are issues with the idea as stated. You might want to check out my addition at the bottom. It’s another really cool idea… that’s just not quite there. There’s gotta be something cool to do with this data structure!

3. One issue with this, any group of followers will look like a voting ring. That is, given any publisher’s unique focus, they will attract a set of followers who like to read their posts. That group won’t mimic the behavior of the global set of reddit users as they are following (and upvoting) a based on what they like… So they will appear to be a voting ring even though they aren’t organized in that fashion.

Worse yet, as a publisher you want to attract a set of passionate followers who follow you and enjoy the content you create. This algorithm would punish that by making that seem nefarious.

So, it’s a neat idea and would definitely show a “voting ring”. However, the difference between a voting ring and passionate followers is completely opaque to this algorithm…

4. @Joe I think your points are quite correct. Though it would be interesting to represent the `voting_honesty` ratio as a Cauchy distribution (which presumes that both the numerator and denominator our Gaussian). And then you could identify any outliers. … Granted, though, outliers might just be a crazy fan following. :-/