With all the Reddit drama and subreddits going dark (and other user-hostile experiments), it got me thinking — how hard would it be to build another site like Reddit? Basically, posts/link sharing, voting, comments, profiles, and a feed, from a technical angle; of course Reddit is a community, not just code, but I’ll focus on the technical side mainly.

Note
I’m of course well-aware that the engineering part is, in some ways, the "easy part" because it omits the entire network effect. But other sites in a similar area are relatively well-received; in this case, the main concerns were nitpicking, things like mobile support and questions about whether monetizing creators is really a good idea. There are also good discussions of alternatives on HN. In any case, feel free to treat these articles as a fun "what if?" guide instead and less as a tutorial.

The goals are:

  • Built for scale — millions of users. (Reddit has something like 50 million daily active users.)

  • Cheap! Well, supporting millions of users is never cheap, but let’s be as frugal as possible.

  • True serverless, where it makes sense.

    • NeptuneDB is "serverless", but still requires scaling up nodes, compared to DynamoDB, where you pay per row, not per node. So I don’t consider NeptuneDB "true serverless".

  • Realtime; avoid batch jobs. Disregarding the actual frontend experience for now, there shouldn’t be batch jobs that run every minute/hour/day that impact the user experience.

Since it’s what I’m most familiar with, let’s build this using AWS with the goal of being as frugal as possible, while still scaling to a multi-million user community. That means no Aurora! But Fargate would still be within reason.

The rest of this article focuses on one of the most important parts: the database architecture and schema of this application.

Posts

The main form of content creation (besides comments) is posts: text, link, or image submissions attached to a particular community feed that people can comment and vote on. Handling scoring and sorting of posts is going to be complicated, but storing and loading of a post itself should be easy.

Let’s design the DynamoDB table schema for a Post.

Table 1. Post schema
Name Type Required? Description

id (PK)

STRING

Required

The unique identifier for a post. This could be lots of things (e.g. a UUID), but for now let’s use the first 64 characters of the post title in URL-friendly form. If there’s a conflict on creation, we’ll append a random 3-digit number and retry. (Specifically, not doing a database sequence number, since those are inherently centralized.)

community_id

STRING

Required

The community (née subreddit) that this post is associated with.

user_id

STRING

Required

This is the user that created the post.

created_at

NUMBER

Required

The time this post was created, in seconds since epoch.

title

STRING

Required

This is simply the user-provided title that they want on their post.

text

STRING

Optional, no default

If this is a text post, the user can write something here. For now, just plaintext w/ autolinking, but later this could support things like simplified Markdown.

image

STRING

Optional, no default

If this is an image post, this is the URL of the image. For now, just the URL, not captions or galleries, but either of those could be implemented

link

MAP

Optional, no default

If this is simply a link to another page, this represents the URL that the user specified. The reason this is a map is we want to bundle other attributes in as well, like the fetched title and possibly microdata.

At least one of text, image, or link must be specified.

That just about covers the most basic Post base table schema. But how do we query it?

Post Query Patterns

Get by ID

If the user navigates in by Google, they’ll land directly on a particular post. Since the post ID is already in the key, we can trivially look up the record.

List by New

If we want to list all posts for a community in chronological order, there’s no way to do this without a Scan, which naturally we want to avoid for performance/cost reasons.

If we create a Global Secondary Index (GSI) on community_id (PK) and created_at (SK), we can then query by community sorted by when the post was created.

Note
This does lead to hot partitions since some communities are much larger than others, but with instant adaptive capacity, this shouldn’t be a significant issue.
Note
Using a GSI forces us to use eventual consistency, but consistency normally occurs within a second, which is acceptable for us.

Voting

Let’s design a table with basic voting support for posts.

  • One vote per user

  • Votes can either be +1 or -1

  • All votes are equal

  • Users can switch their vote to +1, -1, or 0 (rescinded)

Table 2. Voting schema
Name Type Required/Default Description

entity_id (PK)

STRING

Required

This is the post, comment, or other entity that the user is voting on.

user_id (SK)

STRING

Required

This is the user that’s doing the voting.

value

NUMBER

Required

This is the value of the vote — 1 for an upvote, or -1 for a downvote.

created_at

NUMBER

Required

Time when the vote was created. Always good to have.

The design of the composite primary key guarantees that no user can vote on the same post twice; the database will reject such an attempt.

So, this is useful for tracking a user’s votes, but we mainly want to see the total number of votes on each post as a sum. If we were working with SQL, we could write some kind of aggregation query, but even that would be slow to execute on every request. Instead, we want some kind of table that’s dedicated to tracking sums.

Post Popularity

This table will keep track of the number of upvotes, downvotes, and net vote value for each post.

Table 3. Post Popularity schema
Name Type Required/Default Description

post_id (PK)

STRING

Required

This is the post that is being voted on.

net_votes

NUMBER

Required

This is the net number of votes — upvotes minus downvotes. If there are 500 upvotes and 100 downvotes, this would be 400.

upvotes

NUMBER

Required

This is the count/sum of all the upvotes.

downvotes

NUMBER

Required

Like upvotes, but for downvotes.

Calculating things like the ratio of upvotes:downvotes can be left up to the application; we only need to store it if we want to query on it (though I think Sort by controversial almost definitely uses ratios). net_votes we’ll want to query on if we’re querying by popularity.

Applying Votes

There are a couple different ways we could apply votes to the Post Popularity table, but my initial inclination is to attach a Lambda trigger to the DynamoDB stream for the Voting table (docs). Basically, every time any kind of write happens in Voting, our little Node.js Lambda function executes and applies the change to the Post Popularity table.

When configuring these triggers, it’s important to decide what information is written to the stream that the trigger is running on. In this case, we want "new and old images", which enables us to see the old values for the record as well as the ones that were just written. This, for example, lets us update the values correctly if the user is voting for the first time (no record → +1) and also if they’re changing their vote (-1 → +1)

Note
Another way we could do this is to just have the application update the Post Popularity table when it updates the Voting table, but this has a couple downsides. One is if the application encounters an error after Voting is updated, but before/during Post Popularity is updated, then the Post Popularity update may be lost, whereas the lambda trigger will automatically retry. Another downside is we may want to modify the Voting table from multiple codepaths or even multiple applications, and we’d have to remember to also always update the Post Popularity table. Overall, using a trigger here just bolsters our data integrity.
Note
One drawback of using streams is the data only sticks around for 24 hours. So if there’s something crashing your Lambda trigger, you have to act fast to fix it! If we were more worried about this scenario, we could introduce an SQS queue between the real lambda and the Voting table.

So we run a Lambda trigger that updates the Post Popularity table. But if there are a lot of people upvoting a post simultaneously, how do we get atomicity? If we’re not careful, two simultaneous updates can cause the table to be updated only once.

DynamoDB supports the concept of atomic counters, which let you run a tiny atomic operation within the scope of a single document. In our case, in the case of an upvote, we want to set the upvotes field to upvotes = upvotes + 1. So that’s roughly the kind of DDB update that will run inside our Lambda trigger.

This covers "what to run", "when to run it", and "how to run it", with regard to Voting table modifications.

Querying by Popularity

It’s reasonable to ask "What are the top posts from the last 7 days in this community?", where "top" means "sorted descending by `net_votes`".

This is a surprisingly tricky question to answer, but let’s get the obvious out of the way, first: we need to know which votes correspond to which community. We could query Posts for the last 7 days worth of posts, and then do an in-application join by doing a batch lookup into Post Popularity for every post we find, but this will be slow if there are many posts, and will break down quickly with longer time ranges (including all time).

So let’s add this field to Post Popularity:

Table 4. Post Popularity addendum schema
Name Type Required/Default Description

community_id

STRING

Required

This is the community associated with this post.

When the post is created initially, a record in Post Popularity will be created and this field will be populated then. Also, posts can’t change communities, so it’s fine to forget about updating it.

Back to the original problem. We essentially want two orderings — newest to oldest, AND highest to lowest. That’s not quite right — we want highest to lowest, but filtered to a particular time period. Reddit supports numerous time periods — past hour, past 24 hours, past week, past month, past year, and all time.

There are a couple (well, many) naive solutions here. One is to create a GSI on community_id, and then do a DDB Query and pull all results for a community and do the filtering/sorting in the application. Would probably work until a few thousand posts were reached, and then it starts getting noticeably slower, or at least more expensive. This can be mitigated slightly by using a FilterExpression to filter out results outside the requested time period, but this is barely more efficient.

We could instead create a GSI on the post create time (which we’d need to add to the Post Popularity table first) and then refine our query by that, but then this would cross-cut across all communities, which would be much worse, most likely.

If we had a table/index with columns community_id#date (PK), net_score, and post_id (where community_id#date might be something like ComputerScience#2023-06-12), that would give us an easy way to pull the top N posts from that day. If we wanted top posts for a week, you wouldn’t be able to efficiently derive this using this table, though; since the most popular item on one day could be less popular than the least popular item on another day, you really need to pull every post for each of the last 7 days to get an accurate picture. You could just add another table/index for by-week values instead of by-day values, but things like "past week" and "past month" are rolling windows; we’re not querying week 23, we’re querying a 168-hour range.


The best solution I’m able to come up with involves clever use of DynamoDB’s TTL feature. This lets you specify a particular timestamp in the future that, sometime after that, causes the record to be deleted.

Let’s focus on the "past week" use case. We’ll need a table per time window:

Table 5. TopPastWeek schema
Name Type Required/Default Description

post_id (PK)

STRING

Required

We want to avoid posts showing up in this table twice, and also need a quick way to find this row in order to update the net_score.

community_id

STRING

Required

We’ll be filtering by community, so we’ll need this.

net_score

NUMBER

Required

The net score, or something like it, is the main way we’ll be sorting our results.

created_at

NUMBER

Required

The creation time of the post will give us a little flexibility in how we can use this table.

expires_at (TTL)

NUMBER

Required

The magic field. If this date is in the past, the record will be deleted soon.

In order to sort by net_score, we’ll want a GSI that looks like community_id (PK) net_score (SK) and then query for net_score in descending order by setting ScanIndexForward=false in our Query.

Caution
While the TTL field is useful for cleaning out old data, it can take days for the data to get purged. This is fine, but it means we need to check the field in the application, too! Ignore data that should have been expired already. (In fact, we can again use FilterExpression for this to save a little bit of time and throughput.)

Any time we update the net_score in Post Popularity, we want that update propagated to the Top tables too. We can implement that using Lambda triggers again, but I won’t cover it further here.

This table covers "past week". What about shorter time periods, like past hour and past 24 hours? I think most communities will be low enough volume that we can reuse this same table for that and just use a FilterExpression to screen out posts that are too old. If this turns out to be a problem, additional tables for these more specific time windows can be created, but think about it — a "Past Hour" table that has a TTL that may take days to be invoked? Not super helpful. (We could also implement our own "cron job" that does a table scan every few hours to clean up old records, but…​that’s definitely over-engineering at this point.)

For periods longer than a week, we can create more tables for those specific time periods. But tables like "TopAllTime" would eventually end up quite large, since every post ever created would end up in it. This could be mitigated by "archiving" a post after a long time (blocking new votes) and running a batch job periodically that purges posts from the TopAllTime table that are archived and not highly upvoted; that is, posts that were never upvoted much and never will be.

Finally, as also mentioned above, to query for the top results in the past week for a community, we’ll use the GSI mentioned above and:

  • Provide the community_id we’re querying for

  • Omit the net_score part of the key (we’re not looking for a particular score)

  • Specify ScanIndexForward=false to get the highest-valued posts instead of the lowest-valued.

  • Specify a FilterExpression that rejects items that have expired but not been deleted yet.


In this article, I’ve covered the following use cases:

  • Writing and reading a particular post

  • Enabling a user to upvote or downvote a post (once!)

  • Being able to efficiently store and load the net score for a post

  • Being able to efficiently load the most recent posts for a community ("browse by New")

  • Being able to efficiently load the top posts for a community by week (and other common time ranges)

I think that’s plenty for now! In the next post I’ll cover comments, which are bound to be even trickier.