When your data model is the bottleneck: lessons from Medium’s feature store

How Medium rebuilt its feature store data model on ScyllaDB for low-latency recommendations at 1M OPS, plus a DynamoDB benchmark.

Jun 9th, 2026 10:40am by Cynthia Dunlop

Featued image for: When your data model is the bottleneck: lessons from Medium’s feature store

Macude | Mariana Cuesta for Unsplash+

“Keep readers reading” is the not-so-simple goal of Medium’s recommendations system. To predict what’s most likely to appeal to a particular reader at any given time, Medium continuously processes user activity signals (stories read, recommendations shown, follows, likes, etc.). It then immediately correlates that with the steady stream of new articles, which is estimated at millions per month.

Smart models and good inference logic are required, but that’s not enough. The data must be stored and retrieved quickly enough to remain relevant while the user is browsing. That’s the job of Medium’s feature store. And getting the data model right started to matter a lot as they scaled to 1M operations per second.

Andréas Saudemont, Medium Principal Software Engineer, recently walked through how the team identified the problem and what they built to fix it. If you’d rather watch than read, you have two options: Watch a short version from Monster Scale Summit or an extended follow-up webinar

The feature store and its role in Medium’s recommendation system

The feature store ties it all together, ingesting user activity and internal events and feeding them to the ML models that power recommendations. It’s what enables customization like the “For You” feed that greets logged-in users.

A screenshot of Medium's "For you" page.

Each feature is a property of an entity, usually a user or a story. Some are simple and static, like whether a user holds a paid membership. Others capture interaction history: which stories a user has read, what content they’ve recently been shown, etc.

The following diagram shows a highly simplified view of the Medium feature store architecture:

A diagram showing a a highly simplified view of the Medium feature store architecture.

The problem with a relational features data model

When they built their feature store years ago, Medium used relational features for cross-entity relationships. Unlike regular features, a relational feature can have multiple values for a given entity ID. Each value is defined by a relation ID (the ID of the related entity) and a timestamp recording when the event occurred.

For example, a “story users have read” feature is attached to the story entity type. It relates to the user entity type, and its values indicate whether/when a given user has read that story.

Andréas shared the following schema diagram to explain the concept:

A schema diagram explaining the relational features data model.

Features sit at the center, each attached to an entity type and defined by name, version, and data type. Non-relational features are simply a feature, an entity ID, and a value. Relational features add a relation ID mapping to another entity type, plus the value itself and a timestamp.

This approach proved suboptimal from a data modeling perspective. Since relational features link two entity types, the data ends up split between two tables: one for the entity IDs and one for the values. That means you can’t get both in a single query. The first query retrieves only entity IDs (not their associated values) and relies on ALLOW FILTERING. A second query then runs for each entity ID to fetch its value. “If we have 1000 entity IDs for which we want to fetch values, then we have to run 1000 queries to fetch these values,” Andréas said.

Overrelying on ALLOW FILTERING made things worse. “This is bad,” Andréas said, referring to monitoring data showing that 90% of rows read via these queries were simply discarded. “This is just data that we don’t need. ALLOW_FILTERING should be an escape hatch, not our design pattern.”

“ALLOW_FILTERING should be an escape hatch, not our design pattern.”

Chart showing that overreliance on ALLOW FILTERING led to 90.2% of rows read via these queries being discarded.

The list feature model

So they reinvented their data model and shifted to a list-based feature model. Instead of splitting data across two tables, everything for a given entity lives in one place and is retrieved in a single query.

Like other features, a list feature is defined by its entity type, name, and optional version. What’s different is the value. While a non-relational feature has a single value, such as true or false, a list feature’s value is a collection of items, each containing a value and a timestamp. Item values can be of any data type; the feature store doesn’t enforce consistency within a list.

Diagram explaining the list feature concept.

For example, consider a user’s reading history. The entity is user, the feature name is reading history, the TTL is 6 months. After that TTL is reached, the data is automatically dropped by the database (since older history isn’t useful for recommendations). The list for a given user is a collection of story IDs and the timestamps at which they were read. The same story can appear multiple times, and multiple items can share the same timestamp.

Example list of a user's reading history, showing a collection of story IDs and the timestamps at which they were read.

A range of operations need to be supported. Create List and Delete List operations run at most a few times per day. Remove List Items with Value, which lets a reader scrub a specific story from their history so it stops influencing recommendations, runs at 1k-10k per second. Add List Items is higher still: every story read and every thumbnail shown to a user generates an event. Get List Items is the top, at 100k-1M operations per second.

Table showing the number of times various operations run per given timeframe.

“The Add List Items, and even more the Get List Items operations, are really the reasons why we need an efficient data store.”

“The Add List Items, and even more the Get List Items operations, are really the reasons why we need an efficient data store,” Andréas said.

Multiple items, one timestamp

Beyond raw efficiency, the new data model also had to support multiple items with the same timestamp. When Medium shows a user four story thumbnails simultaneously, all four presentation events share the same timestamp, but have distinct story IDs. If this isn’t handled correctly, primary key collisions occur.

The team’s solution was a single list_items table that stores everything.

Screenshot of the code for the list_items table which stores everything.

The partition key combines feature_key and entity_id, keeping all items for a given list together. All of user 123’s reading history is stored in one partition, retrieved in one query. The clustering key concatenates each item’s timestamp with an MD5 hash of its value. The hash is what makes same-timestamp items with distinct values possible.

Relying on MD5 hashes for uniqueness raises its own set of questions, but in practice, the team hasn’t seen collisions. “The values that we are storing are sufficiently distinct, especially when you add the timestamp into the equation,” Andréas said. The table’s clustering order is set to descending so ScyllaDB can optimize for the typical read pattern (most recent N items) rather than leaving the application to sort afterward.

TTL to control storage costs

Storage cost is controlled entirely through ScyllaDB’s native TTL, with no cleanup logic required. Every row expires automatically based on its own timestamp plus the feature’s TTL duration. “We don’t have anything to do regarding that,” Andréas said. “Any row for which the TTL is expired will be considered deleted by ScyllaDB.”

Storage plateaus for a steady write rate. When a feature is retired, its data drains away on its own. “That’s super useful for controlling our storage and usage costs.”

Chart showing storage usage/costs and number of item insertions against time

Implementing the list operations

Add List Items is a logged batch of INSERTs with atomicity guaranteed: all items land or none do. Each row carries its own TTL calculated from its timestamp, so older items expire sooner. Since items almost always carry a current timestamp, new entries append to the top of the partition, which is exactly where reads will look first.

The code to "Add List Items" - a logged batch of INSERTs with atomicity guaranteed.

Table showing the "list_items" table partition before and after running the Add List Items function.

Get List Items runs as a single-partition SELECT with a minimum timestamp and a row limit. “We run the query on a single partition,” Andréas said. “That’s the maximum efficiency that we can have.” The clustering key handles filtering and ordering directly. Post-processing is not required.

The code to "Get List Items" - a single-partition SELECT with a minimum timestamp and a row limit.

The "list_items" table partition before running the "Get List Items" function, the response received from the function.

Remove List Items with Value is the one operation that couldn’t be reduced to a single query. Because value isn’t part of the primary key, a direct filter isn’t feasible.

Code for the "Remove List Items with Value" function.

A local secondary index built specifically for this case first finds the matching item keys, then a batch DELETE removes them by primary keys.

The code to create a local secondary index which lists items by value.

“Using an index is really faster than a scan because the query is highly selective,” Andréas explained. “We have very few items in a given list that have the same values compared to the total number of items in a list. And thanks to the current structure, using a local secondary index is faster than a global index.”

The "list_items" table partition before and after running the "Remove List Items with Value" function.

Andréas shared another example. Starting with the original table partition, the goal is to delete all items with the value “storyC.” Using the local secondary index, the system first identifies the two rows containing that value. It then issues two DELETE statements using the item keys from those rows, which removes them from the list. The final operation, removing all list items, is even more straightforward.

“We can just drop the partition,” Andréas said, “and ScyllaDB does its magic. It just deletes all the rows for that partition, which means that it deletes all the items for the given list. And bonus point: it’s atomic. It’s either completing successfully or not changing anything at all.

The code for the "Remove All List Items" function.

The "list_items" table partition before and after running the "Remove All List Items" function.

ScyllaDB vs. DynamoDB performance

Medium implemented the list operations on top of both ScyllaDB and DynamoDB. The main goal was to benchmark how both databases compared on their actual production data. “Conceptually they are very close,” Andréas noted, “but they have significant differences in how they operate.”

For AddListItems, P50 latencies were low with both databases: ScyllaDB came in under 1.5ms, DynamoDB under 5ms. “DynamoDB is extremely fast, not as fast as ScyllaDB, but extremely fast at sub 5ms latency,” Andréas commented. Things got more interesting at the P95 and P99 latencies. ScyllaDB held steady at around 5-6 ms P95s, while DynamoDB ranged from 13-45 ms. ScyllaDB’s P99s were steady single-digit milliseconds, while DynamoDB’s ranged from 40- 120 ms.

Graphs showing AddListItem latencies. — *AddListItem latencies: The blue line is DynamoDB; the purple line is ScyllaDB*

It was a similar story for GetListItems. At P50, ScyllaDB clocked in at 1 ms, DynamoDB at around 3.5 ms. At P95, ScyllaDB held around 5-6 ms while DynamoDB spiked from 30 – 60ms. And at P99, ScyllaDB remained at ~30ms while DynamoDB ranged from 70 ms all the way up to 220 ms.

Graphs showing GetListItem latencies. — *GetListItem latencies: The top blue line is DynamoDB; the lower purple line is ScyllaDB*

“ScyllaDB is very fast, with very predictable performance, and that’s super important for us.”

One caveat: DynamoDB was running without an extra caching layer. “We expect that could have a significant impact for DynamoDB because of the high cache hit rate that we are seeing on the list,” Andréas said. “But we don’t have the data yet, so we cannot compare them.” His verdict for now: “ScyllaDB is very fast, with very predictable performance, and that’s super important for us.”

Key takeaways

One pleasant side effect of getting the data model right: Medium is now eager to use ScyllaDB for additional feature store workloads. Before, they were holding back because they didn’t want to build on the shaky relational feature foundation.

Reflecting on the path to this point, Andréas left the audience with this parting advice:

“If you have a suboptimal data model, you will have queries that are slow, that will scale badly. And most likely, you won’t be able to optimize that data model. You will have to define a new data model that will be better. So take time to think about your data model before you start the implementation, because once you have production data using your suboptimal data model, it’s too late.”

Cynthia Dunlop has been writing about software development and testing for much longer than she cares to admit. She's currently senior director of content strategy at ScyllaDB.