Original article was published on Artificial Intelligence on Medium
Skiff starts by utilizing containers, something that (as noted) was previously well established at AI2. This is really the only requirement we ask our users to adhere to — if their application can be packaged up by
docker then we can run it.
This was an immediate benefit in that we were able to take many applications that were already containerized and quickly transition them to the new solution. Just like that a number of EC2 nodes were turned off with the flick of a wrist. I could picture Bezos shaking his fist at us as those cobweb laden, low utilization VMs got turned off.
For new applications we help people get started by offering a template that includes a Flask API, a TypeScript and React-based UI, and an NGINX reverse proxy. The proxy serves the UI in production from disk and routes traffic to
webpack‘s development server in local environments. It’s also a nice way to avoid the
Access-Allow-Control-Origin: * that we see all too often.
The next piece of the puzzle involves building images and getting them to Kubernetes. We chose and continue to use Google Cloud Build. As users push changes to
master we take their code, package things up in Docker, and push it to Google’s container registry. Our system then uses Jsonnet to generate the Kubernetes config required for running their application and
kubectl apply‘s it. Shortly thereafter the user’s workload is humming away in the cloud.
Google Cloud Build has done the job we’ve asked fo it for the last year, but could definitely be better. Their build triggers still show up as an opaque hashes in Github’s UI, they lack support for
git features like
submodules and the YAML they require for declaring one’s build is clunky and verbose. We’re actively looking at porting to Github actions — if the solution had existed prior to Skiff’s conception we probably would’ve jumped on that bandwagon.
Once the application gets to the cluster there’s a bunch of machinery that gets things set up correctly for the user. Luckily all of this is transparent to them and remarkably easy for my teammates and I to maintain. We use GKE as our Kubernetes provider — Google’s support is, simply put, best in class. We use Cert Manager to provision LetsEncrypt provided TLS certificates in minutes. And last, but surely not least, the Kubernetes Ingress NGINX Controller handles TLS termination and forwarding requests to the right place.
The setup described served us well for the first few months — in fact, after getting things rolling our team did nothing but transition people to the new stack. We grew quickly — as a small internal team we were having our “startup moment”. Suddenly teams we hadn’t even anticipated were asking if they could use our solution. It must’ve been that nautical metaphor which was all too easy to extend, or maybe they’d all been struck by an inevitable TLS certificate expiration outage and wanted to say goodbye to that problem forever.
As the number of workloads grew we realized we needed a few more bits and pieces to provide an excellent user experience. So we wrote a few additional pieces of software to help people effortlessly launch new applications.
This solution has allowed a team of 3 engineers to run a diverse collection of over 90 web applications that handle millions of requests per month over the last year with 99.95% availability.
The last piece of the puzzle is an application we call the Bilge Pump. This small, reliable piece of machinery is responsible for scanning the cluster for ephemeral environments and removing them. We let users create new application environments from any Github branch via the click of a button. These environments have a configurable expiration, after which the Bilge pumps them back out to sea. This has proven to be a vital mechanism for fast iteration — now code reviews can include a live demo and a chance for product managers and others throughout the org to review and suggest changes.
There are, of course, a few pieces I’m leaving out — as with any system it’s impossible to describe all the details. That said, what’s important is that this solution has allowed a team of 3 engineers to run a diverse collection of over 90 web applications that handle millions of requests per month over the last year with 99.95% availability. What’s even more important is that the pace of development has continued to increase throughout the year — signaling that we’re enabling exactly what we intended to.
You might be tempted to dismiss this as a hype infused post from YAKF (yet another Kubernetes fan). Sure, I’ll admit I’m a big proponent of the technology. But I’m also not shy about admitting that it’s been a complex bit of software to fully understand and operate. We made just about every mistake in the book, and if it weren’t for GKE’s handling of some of the low-level details I’d probably be telling a very different story. That said I can also say that the resulting power, flexibility, and resiliency afforded by Kubernetes has been essential to the success of Skiff. Turns out if you automatically restart someone’s application when it OOMs and run a bunch of replicas, you can greatly improve the end-user experience.
I’m really excited to see what else we bring to Skiff in the coming months and years, and the impactful, forward-thinking applications our researchers develop using it. I also can’t wait to continue to give nautically themed presentations to the company — the good ole’ Captain’s Log never gets old.
⛵️ Smooth sailing out there friends. I think I’m going to go spin up an Apache server, as this post has me (again) feeling nostalgic. I might abstain from writing PHP though as that’s something I don’t miss too much.
Sam Skjonsberg is an engineer on the ReViz team at AI2, building tools and infrastructure that help teammates share their work in new and compelling ways. When he’s not spinning up Apache web servers, you’ll find him riding his bike or adventuring in the PNW with his wife, two dogs, and soon their son (they’re about to welcome a new little one!).