I’ve come up with some ideas for the database design, but I don’t know if any of them would be practical in production:
-
posts
and a separatepost_edits
table with a copy of all previous revisions. To get a single post, it’s completely transparent as if thepost_edits
table isn’t there. Nothing wrong with this, but it looks amateur to have two nearly identical tables and simply copy stuff between them like this. One benefit is that you could instantly droppost_edits
if you ever needed to do that for some reason. -
Single
posts
table with everything in a self-referencing chain. To read a post, you’d filter for all posts with a certain slug or something, then order by most recent datetime to find the “real” post. On first thought this seemed kinda clever, but on second thought this looks really messy… how would you get a list of actual posts, wouldn’t there need to be an extra column just to keep track of which are “real”? Pushing the “real” posts to ElasticSearch or keeping track of them in Redis could solve this issue, but it would be complicated to manage and performance wouldn’t be optimal. -
post_meta
and a separatepost_data
table joined onto it. Whenever a revision is made, a new row is added topost_data
. To get a list of posts: for each row frompost_meta
, select the latest referencing row frompost_data
and join it. On the surface it looks more professional than 1., but it’d be at least 2x slower, and you don’t really get any advantages from it.
How does StackExchange implement it, what does their database design look like? If they use diffs instead of storing the whole thing, is there a python implementation to encode/decode between two versions and their diff?