December 12th, 2016
Over the years, I've been hard at work writing various algorithms to clean up and refine the Mainline Database in order to make it more useful and accurate. Some algorithms clean up the data directly and others act as an assistant to some sort of human judgement call. As I cleaned, I noticed that current events in the world had a large influence on what data seemed to be coming to the surface for cleaning.
As both the Mainline and our userbase grew, so did the amount of data being poured through the servers for analysis. This data taken from a 10,000 foot view, started to give a decent picture of the current happenings and trends in the communities where our users hang out. I started to think about how I could quantify and detect interesting 'events' just from the data that we have pouring in.
"Trending" - What does it mean, exactly?
I started visualizing the information on the servers in interesting ways and would see stuff like this:
To a human, it's pretty clear something significant happened around 11/29. There's a clear uptick in the number of primary occurrences, pages and websites this particular Ultralink is found on. Upon manual investigation, I would usually find that some sort of newsworthy event had caused the increase in mentions.
The word "trending" often seems to be used to denote something that's gaining significant momentum and mindshare as the result of some event. Trending lists are a popular feature on many social networking sites and content aggregators which try to figure out what is the hot topic of the moment.
But how does an algorithm detect these kinds of events automatically? There are lots of different ways to perform anomaly detection on various kinds of data. But there isn't any one-size-fits-all detection solution because every data set is its own unique beast with quirks and unique patterns. After cutting the data many different ways and a lot of poking around, I found some methodologies that work pretty well for trend detection in Ultralink data.
Whatta we got?
First, let's describe the data set we're working with. As displayed in the above graph, we have the number of new primary occurrences of an Ultralink across all the content being washed through the Mainline Database, the number of unique web pages that those occurrences are recorded on and the number of unique websites that those occurrences are recorded on. Each Ultralink occurrence has a timestamp indicating when the occurrence was encountered. I decided to slice the data into buckets corresponding to individual days. I thought that initially, focusing on the data day-by-day would be the most useful. We only store the last 31 days of this kind of data at any given moment, so that's what we have to work with.
So from that data, I wanted to figure out how to detect those big activity spikes, but there were a few pitfalls I had to be wary of:
- I needed to figure out how to detect a genuine event spike as opposed to noise.
- I needed to take scale into account so popular Ultralinks aren't overly weighted when we see an increase in their metrics.
- I needed to understand the nature of our three different kinds of data points and how they might affect the calculation overall.
What time is it?!
Next, I had to decide what time spans to focus on. Sometimes there are hot stories that flare up really quickly, but then die out as suddenly as they started. In addition, there are sometimes more significant stories that linger in the general consiousness and are mulled over for much longer. I eventually settled on examining time spans of the last 7, 3, 2 and 1 days. So as we examine the data, we will be looking at it in terms of those four 'inflection' points.
In the future, I want to look at much larger time spans and be able to quantify meta-trends over the course of months. But for now, this can give us a good perspective on what is happening at the moment.
I then calculated some additional data to help look into different ways of measuring increase, difference or some kind of change. For each Ultralink, I calculated these metrics for each combination of both data point type (primary occurrences, pages, websites) and timeSpan (1, 2, 3 or 7 days):
- A Number of unique data points over the entire 31 day span.
- B Number of unique data points after the inflection point.
- C Number of unique data points before the inflection point.
- D Number of unique data points before the inflection point within timeSpan away from the inflection point.
- E Delta between after and before the inflection point within timeSpan away from the inflection point. B - D
- F Number of unique data points only contributed after the inflection point. A - C (Note: not equal to B)
- G Delta between after and the average before the inflection point. B - C/((31 - timeSpan)/timeSpan)
- H Ratio of increase after the inflection point vs before the inflection point. B/D
- I Ratio of increase of unique data points only contributed after the inflection point. A/C
- J Ratio of increase after the inflection point vs the average before the inflection point. B/(C/((31 - timeSpan)/timeSpan))
- K The historical, per-day average unique data points before the inflection point. C/(31 - timeSpan)
To get a handle on all this new data, I created a quick little visualization tool:
With this tool, I could sort the entire database by these various metrics and observe what kinds of Ultralinks are favored by any particular measurement over a given time span. Now I just needed to figure out how to combine and weight these different values in a way that would suss out real trending events.
Rank and File
I decided to create ranking tables for metrics E, F, G, H, I and J (6) capped to the top 200 for each kind of data point type (3) and time span (4) combination (6 X 3 X 4 ranking tables total). Using an Ultralink's position in individual ranking tables, I calculated a point value (200 - position in the table). After adding each ranking table's point contribution, the total point sums for every considered Ultralink are then sorted from largest to smallest. But not every ranking table should be considered equally and so I started to play around with the weighting of how much each ranking table's point value should contribute to the overall point sum. After some tweaking, I hoped this would produce a pretty good list of trending Ultralinks.
Of the different kinds of data points, it seemed most useful to focus on the number of unique websites that an Ultralink is found on and, to a lesser extent, the number of unique webpages. The number of primary occurrrences of an Ultralink seemed too volitile and favored bigger or more common Ultralinks too heavily. So I decided to not consider primary occurrence numbers in the point sum at all. Even though we might see a huge spike in the number of primary occurrences of an Ultralink, it might be caused by a few pages (or even a single page) that has thousands of references which shouldn't normally indiciate any sort of event significance.
When I first starting playing around with the different weighting values, I focused on the 7 day time span first because it seemed to be easier to measure success. I reasoned that the 7 day span favors movement that is more permenant and significant than the smaller time spans. For 7 days, I eventually settled on weighting things this way (I'm not giving exact numbers because I expect them to change over time as I get a better feel for the data):
- It seemed like weighting G very heavily made sure that the focus of the result was real and large increases over the historical average.
- Weighting E significantly, ensured that large slope increases after the inflection point helped detect if something was 'trending'.
- Weighting to J a decent amount ensured that Ultralinks with a smaller scale of data were still considered.
- Lastly, I gave some consideration to H to again to lessen the oveweight advantage of some Ultralinks over ones with less data.
But weighting the various ranking contributions wasn't enough to filter out some other interesting patterns that kept causing certain kinds of Ultralinks to appear as trending which, on closer examination, weren't really trending. So I added a few filters to remove Ultralinks that didn't pass some basic tests from consideration:
- The number of unique primary occurrence data points after the inflection point must be higher than the historical per-day average before the inflection point. B > K This seemed to help filter out overly bumpy and noisy data which could look like a trending spike if not considering the full history.
- The number of unique website data points contributed only after the inflection point must be larger than 2. F > 2 This ensures that you can't just keep creating pages on the same old websites to pump up the page data.
- The percentage of website and web page data points increase over the average must be at least 200%. J > 2 This makes sure that there is a bare minimum slope that needs to happen to get us to start paying attention.
All together, this produced a really nice of list of Ultralinks that did a good job of identifying what I as a human would judge to be trending based on the data given. I was pretty happy, but I noticed that applying the same weightings over a time span of just 1 day didn't give nearly as good results. I expected my weightings to perform equally well over any given time span, but after some more exploration and experimentaion, I realized why different time spans need different weightings.
What a difference a day makes
Within a span of 24 hours, a majorly trending idea's lifetime does not usually burn completely out yet. It still has to make all the rounds and people need to get reaction, analysis and acknowledgment out of their systems. I found that for the 1 day time span, focusing much more heavily on J, the ratio of increase over the average, seemed to tease out the ideas/topics/people/things/etc. that have just begun to ignite more. G, E and H are all now given a moderate amount of influence and combine to keep things in check and defend against random spikes that don't represent true momentum. I also dialed down the amount of influence that the web page data points contribute to the entire calculation.
I now had a pretty good trending list for the past 24 hours, which turned out to be pretty fun to explore, as I could now see what interesting things have been happening while I had my head down in code all day. I modified the weightings for the 2 and 3 day time spans to be an approximate interpolation between the values I picked for the 1 and 7 day time spans.
In general, it seems like the larger the time span, the more accurate the results. This makes sense as there is much more data to work with in larger time spans. At the top though, the listings seem to be a pretty good beat on what we are seeing in the world from the perspective of our users as well as data we are now proactively crawling and analyzing. Which brings us now to a discussion of where we get our data as opposed to simply looking at what our data is.
Data Sources/I accidentally a search engine?
As mentioned before, I saw all this cached data in the Mainline Database and noticed that we could use it to make some interesting inferences. That data is produced from all the content that comes into the Mainline for analysis. This includes users of our browser extensions, site plugins and various API clients. Not included for analysis is content originating from porn sites (for obvious reasons), other sites which might not normally be publicly accessible, or sites that have a large potential to poison/overly influence results. This means that it isn't neccessarily a representative slice of the internet at large, but just the main stuff our users are surfing to. However, for the last few months I have been rapidly expanding the scope and depth of what data comes into Mainline.
A while ago I wrote a web crawler that was meant to pre-cache Ultralink results for the websites our users surf the most. If we had a good hunch that our users were going to regularly visit articles on the New York Times website, then we could send the crawler out first to pre-calculate the results for new articles before our users came read them. This made sure that the Ultralink results would already be ready and waiting for them. In practice, when the results of a fragment are pre-cached, it cuts down Ultralink load times from something like 1s or 2s (fragment size and load can make this vary a lot) down to less than 250ms.
I didn't actually put the crawler into use at that point, but when I started thinking about trending calculation I realized I could use the crawler to significantly increase the breadth of data that that the trending calculations could work with in addition to improving performance on those specific sites.
Start spreading the news
So now I needed to figure out how to become aware of new content to crawl as it's being published. I started creating a list of news sites that would be useful to analyze for an English-speaking, American perspective (to start with). I initially investigated consuming RSS feeds because they have pretty simple to understand standards and decent adoption. Unfortunately, RSS has been gradually falling out of favor because Twitter has essentially eaten its lunch. Which brings me to Twitter.
Pretty much every content publisher these days has a presence on Twitter because it's an easy and instant conduit to a large swath of readers and fans. I thought I might be able to use Twitter as a one-stop shop for instant content notifications so I created a Twitter account called Ultralink News. This account follows other accounts of major news outlets which combine to produce an up-to-date feed of breaking news stories, deep analysis and everything in between. I then wrote a daemon that conntects to Twitter's Firehose API and gets an instant notification every time a tweet hits the Ultralink News timeline. The daemon then queues that page up for the crawler to go out an analyze the content on the tweeted web page. This means that from the time the article author hits "publish" it's only a few seconds before the new content is analyzed and washed through our Mainline database.
Any time I find a new information source which looks like it might be useful to have analyzed and pre-cached, all I have to do is have the Utralink News account follow the corresponding Twitter account and the content gets added to the incoming stream. As the daemon drinks from the firehose, this allows us to analyze content right as it's published and keeps the trending analysis up-to-date.
I then added an additional feature which causes the Ultralink News account to automatically tweet out the most recent top trending Ultralink for the past day and the past week. Please give it a follow and tweet at it if you have any suggestions of other good accounts that it should follow.
Where to go from here
One thing I would like to start looking into is much smaller trending timespans. I would like to be able to more quickly detect when something is starting to catch fire and be made aware right at the start of the upswing.
Would you like to take advantage of Ultralink's content analysis features for your own site? Would you like to be able to programmatically tap into our trending calculation results? Create an Ultralink Account and start using our REST API! It's free and doesn't have any usage caps on it. Want some functionality, but don't see how to do it? Hop into our Ultralink Community Slack instance and ask us! We're happy to create new API endpoints or optimize others to support any cool projects you might want to create using the Ultralink Mainline.