In the lead-up to the busy holiday season, we were actively reducing the amount of write operations that go to our MySQL databases, particularly writes which occur frequently where alternate solutions are readily available. Ultimately this would improve the reliability of our databases under the heavy loads we were expecting over Black Friday, Cyber Monday and in the lead up to Christmas itself.
One particular case is that of store-level counters. Things like store visitors in a day, views of a product, et. al.
Since forever, we’ve been writing things like daily visitor counts and per-product views directly and immediately to the database. This works OK for light loads, but a busy site can do a lot of needless writing as visitors and crawlers navigate the storefront. This causes unwanted disk IO and replica sync traffic on the database servers, which under extreme circumstances can cause replicas to become uncomfortably behind on their syncs.
The most obvious replacement for this is Redis; an in-memory, optionally disk-backed, single-threaded, advanced key-value store. To the uninitiated, the single-threaded nature seems like a turn off, but it greatly simplifies Redis’ architecture and makes available many fast, atomic operations which can be used to safely and accurately represent the counter data we need.
Redis is essentially a data structure service.
There are many solutions for this dotted around the internet. Ranging from simply counting in Redis and reading that back, to cron-based syncs back into the database. None of the solutions we found fit our specific case.
As usual, our case is a little more complex than most:
While we don’t want to actively write the frequent hits we get to MySQL, we do eventually want to read that data back to show to the Merchant in their control panel. For unique visitors, they’re stored on a daily basis so that specific date ranges can be queried by the Merchant. Continuing this write to the database also reduced the amount of change needed in the rest of the app code since the Merchant data source is unchanged.
We wanted to maintain some semblance of “real-time” updates that writing directly to the database gives. That is, for Merchants which are just starting out, they should be able to visit their storefront once and see their visitor counter go up by one, rather than waiting a day. For well-established Merchants who receive a lot of traffic, they should be able to see the visitor count grow during the day. Both of these give the Merchant a feeling that their Store is “alive” somewhat.
We didn’t want to continuously fill Redis with historical data, because Redis keeps all of its data in memory at all times, meaning capacity on a single Redis node is always limited to how much memory you can give it. Keeping visitor counts from weeks, months or years ago in Redis is a waste of relatively expensive and limited RAM resources.
Our current MySQL and Redis clustering setups have an inherent level of isolation between stores—for a mix of historical and security reasons—making it impractical to write all this to one central location for batch processing later (rather, in our case, the batch process would be awkward to implement and difficult to maintain).
There’s no guarantee that a daily sync job would execute at the right time and within the expected time frame to accurately capture exactly 24 hours of metrics. Maybe the job fails, or starts late, or takes longer than expected, so the ‘counter’ data is skewed for a given time frame because counts kept coming in without getting synced back.
All this means we needed some sort of solution which supports precise daily buckets, doesn’t suffer from concurrency issues (like two concurrent requests resulting in a double-sync of the same counts), can reflect updates more frequently than every 24 hours, doesn’t leave data behind in Redis over time, works within store-level isolation and is able to recover from faults or delays.
The general solution we found using Redis was:
Write counters to a hash for each store, using the date as the hash key (e.g. 20140920) and the count as the value. Specifically, using the
HINCRBYcommand to atomically increment by 1.
Use the return value from
HINCRBY(since it returns the new value) to trigger a “sync every x hits” behaviour.
Set a second, expiring lock key (
SETEX) whenever a sync happens.
Use that expiring lock key to power a “sync every y seconds” behaviour. Specifically, use
SETNX, which will only work if the key did not exist, and we use that fact to trigger a sync, which will
SETEXthe key to make it expire since
SETNXdoes not set an expiration.
Sync by renaming the hash to a random name before processing it to isolate it from new data. Use HSCAN to sync all fields of the hash, which will pick up all data regardless of the date of entry.
The rename can fail if another process caused a sync concurrently, in which case we do nothing because there’s no data.
The rename can also fail in the unlikely event that we collide with another random rename, in which case we try the rename again until it works.
For each date in the renamed hash, merge the numbers into the database (e.g., using
INSERT ON DUPLICATE UPDATEin MySQL)
The lock key, the hash and the rename are the real tricks to this, in tandem they give us correctness and consistency in both low and high frequency situations.
At Bigcommerce, we have two separate clusters of Redis servers; one completely volatile for cache usage, one disk-backed “persistent” redis for data we care about or can’t rebuild. Both are still limited by Redis core limitations such as always-in-memory datasets, but they are managed and accessed differently because of their different uses. In this case, we are using the disk-backed Redis.
In terms of Redis commands, it looks something like the following, where our (adjustable) hit threshold is 50 hits and our time threshold is 60 seconds ( for example’s sake).
Note: An “X” on the left indicates a “visit” to the public-facing app, where multiple commands may be sent to redis during the handling of this visit.
SETEX could be done via a single
EVAL, but is shown in expanded form.
First visitor ever (or first visitor of the day, of the hour, of the minute, etc.)
X HINCRBY s12345:unique_visitors:daily 20140920 1 // returns the current hit count SETNX s12345:unique_visitors:sync 1 // succeeds because no sync has happened SETEX s12345:unique_visitors:sync 60 1 // makes the sync lock key expire RENAME s12345:unique_visitors:daily s12345:unique_visitors:daily_<RANDOM> // isolates data for this sync HSCAN s12345:unique_visitors:daily_<RANDOM> ... // repeat until we have all the data * add results of hscan to MySQL * DEL s12345:unique_visitors:daily_<RANDOM>
A relatively busy store
X HINCRBY s12345:unique_visitors:daily 20140920 1 // returns, say, "48" SETNX s12345:unique_visitors:sync 1 // fails, maybe a sync happened 20 seconds ago X HINCRBY s12345:unique_visitors:daily 20140920 1 // "49" SETNX s12345:unique_visitors:sync 1 // still fails X HINCRBY s12345:unique_visitors:daily 20140920 1 // "50" SETEX s12345:unique_visitors:sync 60 1 // set sync lock, because we reached hit threshold of 50 * sync commands as seen in (1) * X HINCRBY s12345:unique_visitors:daily 20140920 1 // "1" SETNX s12345:unique_visitors:sync 1 // fails ...
Hits across day boundaries get synced eventually
X HINCRBY s12345:unique_visitors:daily 20140920 1 // returns, say, "10" SETNX s12345:unique_visitors:sync 1 // fails, maybe a sync happened 55 seconds ago X HINCRBY s12345:unique_visitors:daily 20140920 1 // "11" SETNX s12345:unique_visitors:sync 1 // still fails X HINCRBY s12345:unique_visitors:daily 20140920 1 // "12" SETNX s12345:unique_visitors:sync 1 // still fails X HINCRBY s12345:unique_visitors:daily 20140921 1 // "1" SETNX s12345:unique_visitors:sync 1 // still fails X HINCRBY s12345:unique_visitors:daily 20140921 1 // "2" SETNX s12345:unique_visitors:sync 1 // succeeds, because let's say the lock just expired * sync commands as seen in (1) results in finding data for both 20140920 and 20140921 *
A store with no hits on a given day still works
X HINCRBY s12345:unique_visitors:daily 20140919 1 SETNX s12345:unique_visitors:sync 1 // fails, for argument's sake * store is idle * X HINCRBY s12345:unique_visitors:daily 20140921 1 SETNX s12345:unique_visitors:sync 1 // succeeds, lock expired during the idle period * sync commands as seen in (1) result in finding data for both 20140919 and 20140921 *
P.S. Since first writing this, we found the
SET command supports flags that could simplify this, but I’m leaving these examples as-is.
So this worked for our case of unique, daily visitors. What about total per-product views?
The same solution works, we just replace the day stamp with a product ID and it all fits in! All the renaming and syncing works just the same.
X HINCRBY s12345:product_views PRODUCT_ID 1 ...