Data Science 101: Using Clusters to Understand Web Traffic

There are a lot of factors that influence the amount of traffic a site receives, as well as how engaged its users are. In this post we’ll take a look at one of the many techniques we use to help understand a site’s traffic: clustering.

What is clustering?

At a high level, clustering is a machine learning technique that puts similar things into the same bucket. This can be done in a supervised or unsupervised fashion. Supervised clustering is like sorting coins based on denomination; you already know exactly what your clusters are. In practice, you’re often dealing with dirty or damaged coins, so it’s not immediately obvious what the denomination is, and hence why you need some machine learning. Unsupervised clustering is a form of clustering where items are lumped together automatically based on how similar they are. Typically you have to specify how many clusters you want your algorithm to spit out at the end, and there’s always a possibility that these clusters won’t be particularly obvious (For example, your algorithm might say, “Hey, I found a bunch of coins covered in green mud!”).

We use unsupervised clustering to help us figure out what the topic of a site is, and how that topic influences its traffic.

Understanding Internet Usage Patterns with Site Categories

For any site, we get a raw understanding of their traffic from our data panel, but we need to scale that up to the entire Internet-using population. To understand broader Internet usage patterns, it helps to know what kind of site we’re talking about. We figure out what sites are about by categorizing their topics. For instance, our data might tell us that sites about data science get 10 times as much traffic as sites about beanie babies (I wish). This means that if I see the same number of panelists visiting a data science themed site as a beanie baby themed site, I can confidently say there are many more data science fans in the wild.

While categorizing sites might sound easy (and it is, for a single site) things get difficult when we want to do this for every site on the web. There aren’t enough interns anywhere to do this before the sun burns out.

There are a couple of other things that make categorizing sites by topics tricky. The first is that there are a ton of possible topics. Every word in the dictionary could be a topic, but slicing sites so finely makes it harder to glean useful information (i.e., ‘sports’ is a more useful topic than ‘world series of underwater handstands, 1917′). One way we combat this is clustering, which is the fancy machine learning term for lumping like things together. As an example, the cluster of sports topics would include things like baseball, football, soccer, and underwater handstands (maybe).

That example hints at the second difficulty with categorizing site topics: a site can have multiple topics and they may not be related in a meaningful way. Contrast espn.com with sportsauthority.com. They’re both about sports, but one is a news aggregator (among other things) and the other is a store. We deal with this issue by letting a site belong to multiple clusters. This is like saying sportsauthority.com looks like a store from one side, and like a sports site from another side.

data-science-optical illusion-rabbit-and-duck

optical illusion of old woman and young girl

Identifying Clusters

Now let’s circle back to how we actually identify these topic clusters. We’re not necessarily interested in how you’d group sites based on only browsing their content. Instead, we’re interested in sites that have similar traffic patterns, which also gives us information about what sites are about.

Let’s take a random site that we know nothing about, foobar.com, for example. From my panel I might notice that people who visit foobar.com are much more likely to visit foo.com and bar.com than those who never go to foobar.com. This tells me two things: 1) foobar.com, foo.com and bar.com are probably about something similar, and 2) these sites probably receive comparable amounts and kinds of traffic. That second piece of information is really important. If I knew how much traffic foobar.com actually receives, I could leverage that information to give you an accurate estimate of how much traffic foo.com and bar.com receive. A similar statement can be made about links between sites (this is how Google got started years ago).

For our purposes, we generate a lot of clusters from our data sources and then let another layer of machine learning figure out which ones are actually useful. This means that these clusters are just one subset of the features (another fancy machine learning term for a variable or attribute or, typically, a column in a spreadsheet) we use to estimate various traffic metrics. How we use these features and how we let an algorithm pick which ones to use are a topic for another time.

In sum, topic clusters are extremely beneficial in helping us understand broad, Internet-wide usage patterns. They help us determine whether or not a site is the kind that people go to every day to keep up with the latest news or the kind of site they check out once a month. Despite being based solely on people’s browsing behavior, these clusters have distinct subjects like “sports” or “tech news”. We’ll dive into how these clusters and other features get incorporated into our models in future blog posts.

Until then, read more about what it’s like to be a data scientist in our post, Understanding Data Science and Why It’s So Important

Source