In September 2017, we hosted the SydPHP Laravel meetup at our BigCommerce offices in Sydney. Here are the slides for the talk I gave on how we have been scaling our infrastructure.
The above slides are not so useful by themselves so below you’ll find more context.
PHP from an Operational Point of View
The diagram on this slide provides a very simplistic overview of some of the various architectural layers that make up our service. The rest of the slides go into more detail on each layer.
This is simplistic view of one of our Application servers. Requests are received by Nginx for routing to the applications.
- Request received by nginx
- LUA: Lookup details of a store in redis (cached)
- Nginx routes requests to one of our PHP-FPM instances
- BigCommerce App - The bulk of the code for our service (currently...)
- Config Service - Manages store configuration
- Shipping Service - Go and get quotes from shipping providers: FedEx, DHL, etc… (This is currently in the process of being migrated to our new containers infrastructure… hopefully covered in a future talk/blog post)
- Twemproxy is used to balance queries across our redis caching infrastructure.
- PHP-Resque workers run background tasks for the BigCommerce App
- BCServer is internally produced software we use to manage the lifecycle of our stores.
- Consul Agent:
- Consul Agent monitors the status of the various services and registers this with the Consul cluster for that Datacenter
The relational database of choice at BigCommerce is MySQL. There are many reasons for this that are well outside of the scope of this post, so let’s avoid the politics and concentrate on the infrastructure at hand.
A few years ago we experimented with Percona XtraDB Cluster. This worked great for the most part and allowed us to build out a big shiny cluster. It was however early release software at the time and ultimately major issues arose, although this was mainly due to the fact that we were still actively using MyISAM tables for fulltext search (has since been rectified!). After multiple issues that we were unable to resolve with Percona support we opted for this simpler approach.
The Active and Backup master are in a Master-Master replicated pair. Failover between these hosts is facilitated using DNS names in Consul. Initially we had been using a floating IP address to route requests to the master, but that restricted us to a single VLAN, which in-turn restricted us to a single datacenter for the masters.
The slave is used for creating snapshots, backups and for running out-of-band queries for data and replicates directly from the Active master. We are currently in the process of implementing Orchestrator to manage the automated failover processes for these clusters.
We have many database clusters based on this design, each serving many thousands of stores. Our sharding topology is simple becasue each store has it’s own database in MySQL. When a new store is provisioned a database cluster is selected from the pool, a user and a database is created on the cluster and these details are added to the store’s configuration.
There are some great advantages to this approach:
- Really easy to shard and grow capacity, we just build a new cluster and add it to the provisioning pool.
- Ability to limit connections per store. This reduces the impact of noisy neighbours.
- Noisy stores can be quickly relocated to other clusters to minimize impact to the platform.
- Excellent separation of privileges creating an additional layer of security. A store can only access it’s own data in the DB based on MySQL permissions. This reduces the likelihood of one store being able to access the data of another.
- No need to implement complex sharding decisions into the application or add middleware.
We have dealt with a number of scaling challenges during our growth. The main one being that while MySQL has been well optimised for tables with multi-million rows, it’s not been built with running a huge amount of tables in mind. We have innodb_file_per_table enabled to allow us to reclaim disk space when a store is cancelled (this happens quite often for trial stores). This results in some clusters with well over 1.5 million tables on disk, producing lengthy startup times for MySQL. To compensate we’ve got servers with huge amounts of RAM. Once the innodb buffer pool is warmed up and the data dictionary is populated, it is highly performant. Combine this with our well distributed and redundant servers and these startup times do not impact the performance of our service.
Our initial foray into using object storage was fraught with issues. We had a large number of teething issues with our initial provider. They had scaling problems trying to cope with the sudden high demands of multiple large customers using their platform. For BigCommerce, this resulted in some prolonged service outages where templates or assets could not be loaded, meaning that storefronts were either totally unavailable OR would render incorrectly. This was NOT ok!
To work around this issue we started by implementing some large HTTP caches in front of the object storage platform, this worked for the most part but it was a pain to maintain and had it’s own problems when the underlying services became slow or unavailable. While this helped with the reads, it did not fix the problem for writes.
After some lengthy internal discussions we came to the conclusion that storing our content in multiple providers and locations would give us the best possible resiliency and reduce our reliance on a single provider.
Thus, the Mule project was born (Asset Storage Service). This is internally developed software that allows us to store assets with multiple remote or local providers.
Mule features pluggable frontends and backends with support for HTTP, Swift and S3. It is designed around a horizontally scalable architecture and centers on a metadata database implemented in MySQL.
Here’s (roughly) how it works:
Asset Storage (writes):
- Applications hit one of the frontend endpoints to store a file (via HTTP(s) load balancers)
- Mule generates an MD5 Hash of the content and registers it in the asset database
- Registers against desired namespace (i.e: www.mystore.com)
- Registers with requested object path (i.e: ‘/foo/bar/file.txt’)
- Mule stores the content in the Primary storage backend
3. If the PRIMARY backend is unavailable, work through prioritised list of other backends.
4. Content is stored on backend with an object name based on the MD5 Hash
- Returns success to the calling request
- Background processes run through integrity checking constantly.
5. Ensures assets exist in all configured backends
6. Ensures content in backends matches the MD5 Hash
Asset retrieval (reads):
- Applications hit one of the frontend endpoints with a request for the object path (via HTTP(s) load balancers)
- Mule does a database lookup to discover the MD5 hash of the object based on the path
- Requests and returns the content from the PRIMARY backend
- If PRIMARY backend is unavailable, work through prioritised list of other backends.
- 200 msec (configurable) timeout before starting a request to other backends in parallel
- Creates circuit breakers for any unavailable backends to reduce impact when one of them is unavailable
Since implementing this system we’ve been able to mitigate failures in a single asset storage backend while keeping the service online. The only downside is the added latency due to round-trip times to other datacenters. Using this system, we’ve been able to seamlessly add and remove other providers to improve our resiliency to failure with zero downtime!
Our volatile redis infrastructure is used exclusively for caching and storing transient information that can be retrieved from elsewhere (such as the database, elasticsearch, asset storage or another remote service). Data stored in this environment is subject to destruction from LRU expiration or when the underlying redis services stop running (upgrades, reboots, failures).
Twemproxy (nutcracker) is running on all of our application servers, giving a unified logical view of the entire cluster to the client. This allows us to horizontally scale the infrastructure across many instances on many hosts.
- Redis is single threaded, so if you are running on dedicated hardware you should run a single instance per CPU.
- In this configuration the datastore exists only in memory. When upgrading redis or performing maintenance you’ll need to splay the restarts to avoid thundering herds hitting your backends.
Used for storing long-lived data we don’t want to lose, such as store configuration and other key-value data essential to a running store.
Each single threaded redis instance is configured with a slave running on a second node. Redis Sentinel is deployed across multiple hosts (to form a quorum) and monitors the 2 instances. This can be repeated for many instances or with many slave instances to a single master. If the current master process fails or is unavailable, Redis Sentinel will automagically promote the slave to the master and (when available) will reconfigure the old master as a slave.
Traditionally you’d access the redis instances via an IP or hostname. With Redis Sentinel however, you’ll first need to make a connection to one of the Redis Sentinel hosts via it’s TCP port and request the details of the current master. This will return (among other things) the hostname/ip and TCP port for the instance which is currently the master. The application can then connect to the master as usual.
The PHP snippet provided in the slide abstracts away this process by utilising Redis Sentinel support in the Predis library.
L4 Load Balancing
Our Layer 4 Load Balancers are where the TCP connections for all storefront and control panel requests are terminated.
Client requests (HTTP, HTTPS) are received by nginx and passed to a backend application server for processing.
We run in excess of 100,000 stores, including trial, demo and sandbox stores. A large number of those stores require their own SSL certificate. Traditionally, this has meant an IP address per certificate. However, now that widespread client support for SNI is available (assuming you are no longer using Windows XP!), we’ve been able to simplify the process. We still have to manage a large number of legacy IP addresses and we are still required to manage a huge number of SSL certificates.
Initially we managed these certificates on disk and had to replicate them out to all the L4 load balancers, this was clunky and involved delays when clients installed their SSL certificates. We recently implemented storage of these certificates in our Hashicorp Vault service. Nginx does not have native support for Vault so we’ve implemented this with lua in nginx by utilizing openresty’s
ssl_certificate_by_lua feature to lookup certificates.
Configurable Dynamic Backends
In certain circumstances it can be desirable to direct traffic for a store to a different set of backend hosts. We use a Lua to run a lookup against redis to point specific stores to another pre-configured backend. This allows us to test new groups of servers, OS upgrades or hardware changes on specific stores to ensure they are performant before rolling out changes to the rest of the production infrastructure.
The configuration for the backends is created via consul-template. This binds to the consul service and responds to changes by re-writing the nginx configuration and re-loading the nginx daemon. This allows us to safely and seamlessly remove failed nodes automatically and gives us a quick and easy option to remove a node from service when maintenance is required.
CDN Asset Cache
To speed up access to assets in the CDN, each L4 Load Balancer has a large amount of local disk storage to cache any requests for assets (mostly CDN traffic).
Layer 3 (IP) Load Balancing
Our Layer 3 load balancers offer three main features:
- Firewalling with IPTables
- IP Load Balancing using LVS
- BIRD for BGP announcements
While IPTables and LVS are reasonably self-explanatory, there is more to be said about the BGP announcements. We manage multiple IP ranges and many thousands of IP addresses, BGP allows any of our L3 Load Balancers to advertise which addresses it is able to respond on. This allows us to balance load and route around outages by migrating any number of IP addresses or ranges to another firewall.
Tech I haven't even mentioned yet!
We use many varied technologies to provide the BigCommerce stack. While this post described one part of our infrastructure, the core service requires other systems and infrastructure to provide all of the features our platform provides.
While we iterate quickly on technologies on our development environments, we need to grow those services to be be scalable, highly available and performant, often under heavy load.
This is just a small part of a snapshot about where we are right now… the future contains many possibilities!
Did you find this interesting? Did I mention yet that we are hiring? https://www.bigcommerce.com.au/careers/