Home
======================================================================
=                 : 

on reddit time travel

=
======================================================================

So I don't bury the lede:

Here is the Reddit Time Machine

Reddit suffers from a problem that most social media platforms do: you can only see the community from the lens of today, never from the lens of last week or a year ago. It has the "Top of (All Time|This Year|This Month|etc)" but that doesn't go nearly far enough because it's still "this" year, "this" month, not any year, or any month.

For users, this isn't normally a problem because most people are only interested in the latest content. Likewise, most subreddits have a low signal-to-noise ratio, so most of the content is useless to look at. Some subreddits are much smaller and have a more tight-knit community, where almost all of the content is worth looking at. For example, small hobbyist subreddits often have interesting content strewn throughout its lifetime and many of these get lost due to Reddit's Top variations.

Luckily, this problem is solvable thanks in large part to the work by Jason Baumgartner and his work on pushshift.io [1]. He downloaded and made available every public submission to every public subreddit for the entirety of Reddit, which totals more than 538 million posts (according to my local copy of the data). With all this data, it's possible to reorganize it to answer the question of: "what did Reddit look like on this subreddit on this day?"

Of course, it's not entirely that easy. Reddit uses a special algorithm to sort the posts to get its "Hot" view of the subreddit whose value changes all the time. It factors in things like the number of upvotes a post has, how long it's been since it was posted, and more, which means that it's impossible to pre-compute this for all points in time. Instead, it is a lot easier to simulate the New page, which means that the question we're answering is "what did the /new/ page of this subreddit look like at this time?".


----------------------------------------------------------------------
-                             

the website

-
----------------------------------------------------------------------

This leads to the work I've done on this problem. So without further delay, here's a link to what I dub the Reddit Time Machine.

Reddit Time Machine

The website works on a per-subreddit basis, so it doesn't currently support multireddits or joining subreddits with plus signs (like "AskReddit+Pics"). The name doesn't have to match capitalization (e.g. you can use "askreddit" instead of "AskReddit"). The date input uses the native browser one, so it might be different across browsers. In Chrome, it's "mm/dd/yyyy, hh:mm ss". Regardless, there is a "Random Date" button which will just pick between the start of Reddit and when I downloaded the data (i.e. it is not per-subreddit, so it's possible to set the start date before the subreddit was created and not see any posts).

Once you select the subreddit and date and then "submit", you'll be taken to a page that looks like the very old Reddit design. Clicking on a self post will take you to that post on Reddit, while a link post (to an article or an image) will take you directly to the link. Clicking the user will take you to their profile, and clicking "X comments" will take you to the comments on Reddit. In other words, everything except the link aggregation is handled by Reddit itself and not by this website.

In addition to the Reddit links, there are also "+/- day/week" links at the top of the page. These will modify the date that you gave in the beginning and facilitates navigating, not by pages, but by time.

As far as privacy goes, I've hidden the requested subreddits from the logs. I figure there won't be many people using it at a time, so to reduce the risk of deanonymizing people, it's best to hide that. Also, I just don't want to see what people are looking at.


----------------------------------------------------------------------
-                            

how it works

-
----------------------------------------------------------------------

The backend is a Python HTTP server, using the http.server module. The main view of the website (the Reddit-like page) is rendered with Jinja2 templates. The rest of the code is Python to go from subreddit, date, limit, and offset to actual post contents, but the logic behind it is interesting from a technical point of view.

The raw data used here is 134 GB compressed, 993 GB uncompressed. This is around 538 million posts and includes what subreddit, what time, what author, and also the text of self posts and URLs for link posts. What this gets you is a set of newline-delimited JSON posts, in ID-order (so, ordered by the date that it was posted to Reddit, but all subreddits are jumbled together).

Just indexing this is a challenge, so I had to pick a specific subset of it to index. Thankfully, because the original intention was to have this "view Reddit from the perspective of this day", it meant there were a few easy columns to deal with: subreddit and when it was posted. We also need to keep track of which line in the raw data has this post. To extract this information, I used the jq tool. Ideally, instead of which line has this post, I'd keep track of which byte range had this post, but jq doesn't have that functionality. So in addition to having the raw data, I also needed an index file that tells what the starting byte is for this line, and then by peeking at the next line, we get the range of bytes.

To summarize, looking up a particular post goes like this:

  (subreddit, date) --[sqlite query]--> (month, line number)

  (line number) --[sqlite query (month)]--> (start byte, end byte)

  (start byte, end byte, month) --[file seek (month)]--> (data)

First we take the subreddit and date and look them up in a "submissions" database (34 GB). This yields a month/line number combination. We need this distinction because the raw data is split into YYYY-MM files, so the line number is unique to that particular file.

Then we need to take this line number and look it up in another "lineindex" database (35 GB). This yields our start and end byte.

Then it's just a matter of opening the file, seeking to that byte, and reading the right amount of data, and parsing it as JSON.

There is one more database (122 MB) that gets used to convert between any capitalization of a subreddit to its canonical name, used in the raw data. This helps when you type "askreddit" but intend to access "AskReddit". There is some disabled functionality to do fuzzy matching for subreddits as well, but it was causing performance problems, so I cut it from the code.


----------------------------------------------------------------------
-                             

references

-
----------------------------------------------------------------------

[1]: Jason Baumbartner runs the website pushshift.io which collects and makes available lots of Reddit data. He also goes by /u/Stuck_In_The_Matrix on Reddit and has a few posts on the datasets subreddit about all the data he's collected.

https://pushshift.io