How do you keep your developers happily developing, specifically within the context of a meshed, micro-service solution, requiring dozens of services and underlying executables, while preventing them from losing their ever-loving mind when they can’t reach reddit for a quick read because their workstation is petered out of memory and cpu? You give them a clustered development environment on which to work.
Instead of running everything on a container orchestrator on the workstation, you spread the load out onto the cloud, on a few virtual machine instances (which we’ll call a cluster) as well as onto the local workstation. You then use a bit of dnsmasq magic and a service registry (Consul) to make all references to running services trivial. It wouldn’t be complete if you weren’t able to code on any one component, so you add a way to bypass any running service with a local, in-development copy on the local workstation. Using any of their favorite IDEs, developers can write and test code running in the mesh without overloading the local workstation with everything at once.
The solution itself was a work of art as much as it was a herculean effort to implement. Each developer now has her workstation as well as her cluster to work with. But think of that for a second, instead of just having one workstation, paid for and depreciating in value, a developer now has essentially 1 + n, where n is the number of vms you’ve allocated for them. These additional n machines are not depreciating in value and come with a sometimes steep monthly cost, whether you are using them or not. This additional monthly expense adds up quickly. So do you continue paying for the n instances that are still up and running on the cloud doing nothing if the developer is done for the day? Or, do you rely on the engineer to stop their instances and restart them when they next need them? How about neither.
Google Cloud Platform (GCP) is more than happy to keep taking your money while you are asleep while the instances you’ve provisioned are not. There is a way to resolve this dilemma and capture some of that excess spend.
Developer Average Work Day
Your average developer works around 8 hours a day. That’s eight hours of coding, testing, deploying, rinsing and repeating, sprinkled with trips to get coffee. GCP instances are charged on 1 hour increments. So do you leave the instances up and running when the day’s over so as to avoid any loss of productivity when next you pick up where you left off? No, you don’t. You try to leverage GCP’s apis to minimize your costs, as any responsible business should. As a pleasant side-effect, and as my experience has shown me, when you strive to save some money, you inevitably make your infrastructure leaner and more efficient. You find ways to use the least amount of resources while balancing the need to maintain a responsive, reproducible system.
With the intention to save a few infrastructure dollars, we set out to implement a solution: enter Suspend and Resume.
Suspend and Resume
Until recently, GCP only offered you the ability to stop and start your instances. Just like any machine, stopping an instance will clear its memory, and shut down gracefully, where possible. Starting an instance works similarly, though you are responsible for loading back into memory any process you need to have running. This may entail using the init.d directory to run a script or any number of other monitoring tools. It’s up to you to make sure a vm start puts everything back where it needs to be. GCP has no clue what process you want running.
You could say that perhaps the right approach is to stop and start instances not in use. That is a solution. But it’s a solution that will stop any running process an individual developer may have launched off-script. That is, if your mesh requires services a through n to run and is tooled for restart of such, if a developer wants to launch service n+1, they’ll also have to make sure it can be restarted. Perhaps they aren’t that far along in their implementation yet. Perhaps they are running something completely unrelated to the running services, like a monitoring tool, or some ETL process. Stopping their instance will stop them in their tracks and possibly force them to recreate a “snowflake” they’d been working on when their own time and energy ran out. Let’s not do that to them. As much as we want to create an atmosphere of homogeneity among our developer class, they are as individual in what they are working on, and the way they work, as they are in their own person.
So where do we turn? As luck would have it, GCP released — in an Alpha update to its compute API — the ability to suspend and resume an instance. This single enhancement changes the game.
Instead of relying on developers to self-regulate and turn their instances off during non-working hours, we could do this for them. Importantly, a suspend will suspend the running processes and snapshot the memory, so that a subsequent resume continues exactly at the same place it left off. You don’t have to gracefully quit running processes and you don’t have to tool to get services back to where they were before a suspend. Also, and more importantly, we reap the same reward as stopping an instance altogether. That is, we incur no cost while a machine is suspended except storage of the snapshot, which is chump change compared to keeping the instances running.
If your developers only actively use their clusters of instances 8 hours a day, we can save up to 16 hours of cost by suspending the underlying instances. Win, win!
So let’s expound on the part about putting things to sleep for our developers while they are off the clock.
The interval between automatic suspending and resuming an instance is what we term “Automatic Sleep”. This is a period of time bracketed with a suspend, followed by a resume that — hopefully — goes unnoticed by the corresponding owning developer and harmlessly saves us a wad of cash. In an organization with hundreds of developers and contractors, spread across multiple time zones, keeping track of who owns what instances and what state those instances are in is a job for a piece of software we wrote that we affectionately call Maestro. Like a well-coordinated orchestra comprised of wind, percussion and string instruments, a Maestro coordinates who plays and at what time.
Similarly, our Maestro keeps track of our developer clusters. It knows when to automatically put them to sleep and when to wake them, always with the owner’s corresponding time zone in mind. We typically automatically sleep the clusters during the early hours of the morning, when we presume most people will be asleep themselves. Additionally, if developers are on extended leaves of absence, they can put their clusters to sleep indefinitely, waking them upon their return. Using a cli tool we built, developers can ask for their clusters to be awoken, suspended, rebuilt, etc. The tool also integrates with Slack to give some useful notifications.
Suspend and resume can be done asynchronously if you wish to fire and forget. However, because we like to report back to our developers when their cluster is available if they manually resume them — since they really can’t continue working until their cluster is up and awake — we use GCP’s operations api with a predetermined timeout period to poll for when the operation of suspending/resuming has completed. Polling for completion also allows us to update our object model with the current state of the underlying instances.
What we found is that suspends take about 4x more than resumes. Resuming a single instance is still under 4 minutes. Your mileage may vary. Our instances are typically 2cpu x 24GB-RAM x 200GB-Storage in size. If you plan to suspend hundreds of instances at once, it’s best to use multiple threads or work queues to perform them.
Where From Here?
It’s not uncommon for some developers to be night-owls. I’m one of them. I work best from 8pm to 1am. So having a pre-determined automatic sleep cycle that turns off my instances when I’m most productive is not the best use of my patience. So going forward, we intend to give users the ability to set their own optimal work hours. This would free the instances to be put to sleep during the complement of that period of time; any time not during the optimal work hours. As well, when there’s that off-chance a fire drill keeps us up late, even the day-walkers, having the ability to anti-snooze would be good. It would work like a snooze button on your alarm. But instead of delaying a period of time before it tries to wake you, our anti-snooze button would delay when automatic sleep kicks in, giving you a day or so before your instances are next put to sleep.
Originally published 4/2/2019 on Medium.