Why Skillshare runs on Kubernetes

Never stop learning — especially not because of downtime

Matias Forbord
Skillshare Writings

--

Stormy waves. Not what you want your deployments to look like. Via Unsplash by John Towner

Skillshare is scaling. Members, teachers, classes, class projects, employees, you name it. If Skillshare goes down, even for a little bit, all that learning stops, and that is a DevOps nightmare. On the DevOps team it is our responsibility to keep the site up at all costs.

In 2016, we were deploying with Chef and Opsworks. It was a drawn out, multi-step process with too many dependencies and opportunities for things to go wrong.

To take our infrastructure to the next level and support our growth, we took a hard look at containers. The Docker revolution changed the landscape of website infrastructure. Our existing setup was slowing us down, and maintenance costs for even simple tasks were growing more cumbersome as Skillshare and our deploy process grew.

At that time, there weren’t that many established container orchestration solutions. After investigating the contenders, Kubernetes emerged as the clear winner for our needs. The course was set, and we began containerizing our infrastructure.

Cloud orchestration usage percentages in production, from Cloudify:

OpenStack container orchestration usage percentages from 2015 to 2016

Today, our entire stack lives within Kubernetes. Each piece runs in its own Docker container. Each container lives in a pod managed by Kubernetes, all running on AWS EC2 instances.

Kubernetes is self-healing and self-scaling when utilized. If traffic and throughput spike, our Nginx and PHP pods scale up and response times stay flat. If background tasks surge, our Resque-Worker pods scale up and keep the queue small.

A stable deployment means a “blue green” roll out of new code. No timeouts, no dropped requests, no downtime. We want to learn, teach and connect with each other, not wait for the page to load.

Tranquil and uneventful. This is what you want your deployments to look like. Via Unsplash byHarli Marten

To achieve this, we optimized for stability between each piece of our stack from the start of the request all the way down to the PHP code and back. It is often between these technologies that something can go wrong during a deploy.

When new Nginx pods spin up, requests to the previous pods are still in flight. To keep requests healthy throughout the deployment, we also use connection draining. This lets the AWS elastic load balancers end each request in a controlled manner. `terminationGracePeriodSeconds` is set to let requests finish before the old pods terminate.

It is the annotations on the Nginx service that wire it together with the elastic load balancer. The service doesn’t receive traffic until the elastic load balancer health check passes. New pods spinning up don’t receive traffic until their health checks pass. Then the PHP service adds it to its available pods and starts sending it traffic.

Taking a look at the entire config, you can see that we use placeholder values wrapped in {{}} Environment variables are injected into the spec before the deployment rolls out. This lets us reuse the same specs across different namespaces. Only the environment variables change, and we can relate to the infrastructure as code.

Horizontal pod auto scaling can only scale within the available compute on the nodes. Once the nodes start running out of room, the cluster autoscaler kicks in. The cluster autoscaler adds nodes when it sees a pod that can’t fit anywhere. It increases the desired amount of nodes in the Auto Scaling Group, and AWS takes it from there. Once the new node is available, the pending pods are now able to be scheduled without issue. Around this time the cluster autoscaler might play a little pod scheduling Tetris. It will move pods around to free up enough room to scale down a node or make room for additional pods.

All these pieces work together keeping our response time flat and queue length short. With Kubernetes configured this way we have gone from stormy seas to tranquil waters during our deploys.

Thanks for tuning in to this blog post about Kubernetes and blue/green deployments at Skillshare. Would you like to learn more about one of the areas mentioned here? Or how we’ve handled other pieces of our infrastructure? Let us know in the comments!

Join us!

Does this sound like something you would like to work on? Help us build the learning layer of the web! We are hiring for many teams, check out our openings!

--

--