Lessons from Netflix Software Development

Is Netflix a Good Example?

It is certainly a matter of personal opinion, but I believe Netflix is a good example of what we as software developers/software companies should be striving towards.  Not only do they have close to 100 million people that use their software, I have not experienced an issue with our streaming service (that is not to say they do not have bugs, but they seem to be reliable from what I have seen).  Not to mention, I have never seen them take their service offline to ‘update their servers’.  So they have some deployment scheme that can update while users are using their service.  Sounds like they have things figured out.

How Does Netflix Build and Deploy Software?

I personally have not had contact with Netflix or any developers that work there.  However, they posted an article a little while back on how they develop software.  I strongly encourage you to read it before reading this post.  We can learn a lot from it.

https://medium.com/netflix-techblog/how-we-build-code-at-netflix-c5d9bd727f15

Smartphone Apps (including Netflix)

Have you read it yet?  You will get a lot more out of this post if you read it first.

Personally, I did not care as much about what tools they use as much as I do the processes they follow.

Let’s analyze the steps they take.

  1. Code is built/tested on the developer’s machine using a tool.
  2. Changes are checked-in.
  3. A continuous integration server – Jenkins – is used build, test, and package the code.  The same tool is used to build and test the code on the continuous integration server as is used on each developer’s machine (common environment).
  4. They deploy.
  5. Another test is executed after it is deployed.

Yes, there are other steps, but let’s focus on those steps.

Building Code

Tools

Netflix uses a tool that helps manage dependencies.  I have seen workplaces that have issues with different compiler versions, different tools, different dependencies that are installed, and other environmental issues that wreak havoc on running code or reproducing issues.

This is such a waste of time!  Your time is not best spent on debugging these types of issues!  I have seen way too much time spent on what I call time-sinks.  Once you debug this type of issue, you do not have a process that makes things better.  You may understand your environment better, but it does not help others.  Perhaps if you document it, it could help.  But there is a better way.  Make it part of a repeatable process.

Hourglass

Documentation goes out of date.  Using a dependency/environment tool to enforce these dependencies IS part of a repeatable process.  Rather than sink time into debugging why the environment is not setup properly, that time is much better spent enforcing the environment is the same for everybody.  Not just documenting it… but enforcing it.  So if something changes, it is not something that people must read and update manually.  It automatically changes for them.

My tool of choice for this is Docker (or a VM if Docker will not work).  An example of this is the unit test Docker container that I created.  The continuous integration server can use that container… as well as each and every developer, regardless of what the rest of their environment is.  That being said, use whatever tool works for you.

Continuous Integration

Before code is production-ready, it needs to be verified automatically by a server that is not used for development.  It does not need to be manual, because that time is another time-sink.  You do it once, and it does not help every other time in the future it needs to be run again.  You are much better off spending that time developing something else.

Netflix uses Jenkins as a continuous integration server.  Not only does it automatically test their code, but it also packages up their code in preparation to be deployed.  There is also no point in manually preparing your code for deployment.  Once again, this is a sunken cost.  Manually doing this process does not build up any automated solution that can be improved upon over time.  It is also consistent, unlike a manual process.

Deployment

Generally a lot of planning, manual testing, and manual effort goes into deploying software.  Everybody is in a rush to get it done, and testing occurs at the end.  It hopefully does not find anything, but it usually does.  Then you have to go back to debugging/fixing/testing again.

PlanningNetflix has a ‘bake’ routine in order to deploy their packages automatically on the Amazon cloud.  The whole thing has been automated.

Can you imagine how much time that saves?  Rather than a lot of manual testing, planning for deployment, manually creating branches in source control, and other things, it is an automated process.  Something that can be improve upon.  And something that probably saves hundreds if not thousands or tens of thousands of man hours.

Post-Deployment Testing

Once everything is deployed, there are ‘custom’ ways of testing the deployment.  Even when everything seems to have gone right and is well-tested before the software has been deployed, it is still worth checking that the deployed code is working.  Perhaps this means a REST interface can be exercised in certain ways.  Maybe that means Selenium can be used to drive some things in a web browser.  Whatever it means, it frees up even having to check the system once it is deployed.  Once again, even more time is saved.

Quality

In order to test the deployment, you must have the actual equipment the software is being deployed on.  This is not as big of a deal when it is deployed in the cloud or some other easily-accessible piece of hardware.  It is worth noting, however, that this type of testing cannot be emulated.  It is testing the actual deployment, so faking it out does not accomplish this.

You Must Have the Right Culture

I really appreciate that the Netflix post includes some content on creating the type of culture that promotes these types of processes.  They actually have an entirely different presentation on their site devoted to it, but this is the topic for perhaps another post.

Teamwork

The one thing I want to emphasize is that even though they have a nice system, they are trying to make it even better.  Their ‘bake’ time they reference takes too long, and the dependencies aren’t resolved well enough with Nebula.  They do not say, “Hey, this thing is pretty good.  Not perfect, but it’ll work.”  Instead, they make it even better.  It will take even less time to run.  It will cause even fewer headaches with dependency management.  Continuous improvement makes the best development teams even better.

In order for this to happen, you need buy-in from both the management level and other developers.  Other developers must be willing to help or, at a minimum, to use the solution that is developed.  Management must be willing to allow time for infrastructure to be worked on.  If the team is always ‘firefighting’ – attempting to get the next feature or delivery out the door and is never improving existing processes or workflows, it will never get better.  The culture must be changed.

It is worth the time and effort to implement these processes that make your team better.   Perhaps not in the short-term (for a tight deadline), but if work is never done towards improvement… and the team is always trying to meet a deadline… things will never change.  That might work for those who enjoy that type of environment, but it is not good for customers (missed deadlines) or those developers who like the higher level of quality and predictability that these infrastructure changes can provide.

How Can You Use This Information?

  1. You must cultivate the right culture to support these improvements.  Hopefully I’ll cover this in a future post.
  2. If you don’t use continuous integration already, start using it.  You can start by writing a unit test against a very small piece of code.  Then find a server that can check out that code and run that test.  Preferably, run it in an environment/dependency tool that can be run on your local machine.
  3. If you have a special process for deploying your code, make it automatic.  Once again, start small and work your way up.
  4. Test after you deploy.

What can you apply from Netflix’s example to your company or culture?

Leave a Reply

Your email address will not be published. Required fields are marked *