Reducing Human Error in Software Based Services

I removed the second half of this, because the ideas weren’t fully thought out yet.

As I’m writing this Texas and Louisiana continue to deal with the impact of Hurricane Harvey. Hurricane Irma is heading towards Florida. Los Angeles just experienced the biggest fire to burn in its history (La Tuna Fire). And in the last three months there have been two different collisions between US Navy vessels and civilian ships that resulted in 17 fatalities and multiple injuries.

The interaction and relationships between people, actions, and events in high risk endeavors is awe inspiring. Put aside the horrific loss of life and think of the amount of stress and chaos involved. Imagine knowing that your actions can have irreversible consequences. Though they can’t be changed I’m fascinated with efforts to prevent them from repeating.

Think of those interactions between people, actions and events as change. There are examples of software systems having [critical][1] or [fatal][2] [consequences][3] when failing to handle them. For most of us the impact might be setbacks delaying work or at most a financial consequence to our employer or ourselves. While the impact may differ there are benefits from learning from professions other than our own that deal with change on a daily basis.

Our job as systems or ops engineers should be on how to build, maintain, troubleshoot, and retire the systems we’re responsible for. But there’s been a shift building that has us focusing more on becoming experts at evaluating new technology.

Advances in our tooling has allowed us to rebuild, replace, or re-provision from failures. This starts introducing complacency, because the tools start to have more context of the issues than us. It shifts our focus away from reaching a better idea of what’s happening.

As the complexity and the number of systems involved increases our ability to understand what is happening and how they interact hasn’t kept up. If you have any third-party dependencies, what are the chances they’re going through a similar experience? How much of an impact does this have on your understanding of what’s happening in your own systems?

Atrophy of Basics Skills

The increased efficiency of our tooling creates a [Jevons paradox][4]. This is an economic idea that as the efficiency of something increases it will lead to more consumption instead of reducing it. Named after William Jevons who in the 19th century noticed that the consumption of coal increased after the release of a new steam engine design. The improvements with this new design increased the efficiency of the coal-fired steam engine over it’s predecessors. This fueled a wider adoption of the steam engine. It became cheaper for more people to use a new technology and this led to the increased consumption of coal.

For us the engineer’s time is the coal and the tools are the new coal-fired engine. As the efficiency of the tooling increases we tend to use more of the engineer’s time. The adoption of the tooling increases while the number of engineers tends to remain flat. Instead of bringing in more people we tend to try to do more with the people we have.

This contributes to an atrophying of the basic skills needed to do the job. Things like troubleshooting, situational awareness, and being able to hold a mental model of what’s happening. It’s a journeyman’s process to build them. Actual production experience is the best teacher and the best feedback is from your peers. Tools are starting to replace the opportunities for people to have those experiences to learn from.

Children of the Magenta and the Dangers of Automation

For most of us improving the efficiency of an engineer’s time will look like some sort of automation. And while there are obvious benefits there are some not so obvious negatives. First, automation can hide the context of what is, has, and will happen from us. How many times have you heard or asked yourself “What’s it doing now?”

“A lot of what’s happening is hidden from view from the pilots. It’s buried. When the airplane starts doing something that is unexpected and the pilot says ‘hey, what’s it doing now?’ — that’s a very very standard comment in cockpits today.'”
– William Langewiesche, journalist and former American Airlines captain.

In May 31, 2009 228 people died when Air France 447 lost altitude from 35,000 feet and pancaked into the Atlantic Ocean. A pressure probe had iced over preventing the aircraft from determining its speed. This caused the autopilot to disengage and the “fly-by-wire” system switched into a different mode.

{% include youtubePlayer.html id=page.AirFrance447 %}

“We appear to be locked into a cycle in which automation begets the erosion of skills or the lack of skills in the first place and this then begets more automation.” – William Langewiesche

Four years later Asiana Airlines flight 214 crashed on their final approach into SFO. It came in short of the runway striking the seawall. The NTSB report shows the flight crew mismanaged the initial approach and the aircraft was above the desired glide path. The captain responded by selecting the wrong autopilot mode, which caused the auto throttle to disengage. He had a faulty mental model of the aircraft’s automation logic. This over-reliance on automation and lack of understanding the systems was cited as major factors leading to the accident.

This has been described as “Children of the Magenta” due to the information presented in the cockpit from the autopilot being magenta in color. It was coined by Capt. Warren “Van” Vanderburgh at American Airlines Flight Academy. There are different levels of automation in an aircraft and he argues that by reducing the level of automation you can reduce the workload in some situations. The amount of automation should match the current conditions of the environment. It’s a 25 minute video that’s worth watching, but it boils down to this. Pilots have become too dependent on automation in general and are losing the skills needed to safely control their aircraft.

{% include youtubePlayer.html id=page.ChildrenOfMagenta %}

This led a Federal Aviation Administration task force on cockpit technology to urge airlines to have their pilots spend more time flying by hand. This focus of returning to the basic skills needed is similar to the [report released][5] from the Government Accountability Office (GAO) regarding the impact of maintenance and training on the readiness of the US Navy.

Based on updated data, GAO found that, as of June 2017, 37 percent of the warfare certifications for cruiser and destroyer crews based in Japan—including certifications for seamanship—had expired. This represents more than a fivefold increase in the percentage of expired warfare certifications for these ships since GAO’s May 2015 report.

Complexity and Chaos

Complexity has a tendency to bring about chaos. Due to the difficulty of people to understand the system(s) and an incomplete visibility into what is happening. If this is the environment that we find ourselves working in we can only control the things we bring into it. That includes making sure we have a grasp on the basics of our profession and maintain the best possible understanding of what’s happening with our systems. This should allow us to work around issues as they happen and decrease our dependencies on our tools. There’s usually more than one way to get information or make a change. Knowing the basics and staying aware of what is happening can get us through the chaos.

Valid Security Announcements

A checklist item for the next time you need to create a landing page for a security announcement.

Make sure the certificate and the whois on the domain being used actually references the name of your company.

My wife sends me a link to this.

I then find the page for the actual announcement from Equifax.

Go to the dedicated website www.equifaxsecurity2017.com and find it’s using a Cloudflare SSL certificate.

The certificate chain doesn’t mention Equifax other than the DNS names used in the cert (*.equifaxsecurity2017.com and equifaxsecurity2017.com).

What happens if I do a whois?

$ whois equifaxsecurity2017.com
   Domain Name: EQUIFAXSECURITY2017.COM
   Registry Domain ID: 2156034374_DOMAIN_COM-VRSN
   Registrar WHOIS Server: whois.markmonitor.com
   Registrar URL: http://www.markmonitor.com
   Updated Date: 2017-08-25T15:08:31Z
   Creation Date: 2017-08-22T22:07:28Z
   Registry Expiry Date: 2019-08-22T22:07:28Z
   Registrar: MarkMonitor Inc.
   Registrar IANA ID: 292
   Registrar Abuse Contact Email: abusecomplaints@markmonitor.com
   Registrar Abuse Contact Phone: +1.2083895740
   Domain Status: clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited
   Domain Status: clientTransferProhibited https://icann.org/epp#clientTransferProhibited
   Domain Status: clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited
   Name Server: BART.NS.CLOUDFLARE.COM
   Name Server: ETTA.NS.CLOUDFLARE.COM
   DNSSEC: unsigned

Now I want to see if I’m impacted. Click on the “Check Potential Impact” and I’m taken to a new site (trustedidpremier.com/eligibility/eligibility.html).

And we get another certificate and a whois lacking any reference back to Equifax.

$ whois trustedidpremier.com
   Domain Name: TRUSTEDIDPREMIER.COM
   Registry Domain ID: 2157515886_DOMAIN_COM-VRSN
   Registrar WHOIS Server: whois.registrar.amazon.com
   Registrar URL: http://registrar.amazon.com
   Updated Date: 2017-08-29T04:59:16Z
   Creation Date: 2017-08-28T17:25:35Z
   Registry Expiry Date: 2018-08-28T17:25:35Z
   Registrar: Amazon Registrar, Inc.
   Registrar IANA ID: 468
   Registrar Abuse Contact Email: registrar-abuse@amazon.com
   Registrar Abuse Contact Phone: +1.2062661000
   Domain Status: ok https://icann.org/epp#ok
   Name Server: NS-1426.AWSDNS-50.ORG
   Name Server: NS-1667.AWSDNS-16.CO.UK
   Name Server: NS-402.AWSDNS-50.COM
   Name Server: NS-934.AWSDNS-52.NET
   DNSSEC: unsigned

I’m not suggesting that the site equifaxsecurity2017 is malicious, but if you’re going to the trouble of setting up a page like this make sure your certificate and whois actually references back to the company making the announcement. If you look at the creation dates for the domain and the Not Valid Before dates on the certs they had plenty of time to get domains and certificates created that would reference themselves.

How I Minimize Distractions

Being clear on what’s important. Being honest with myself about my own habits. Protect how and where my attention is being pulled. Manage my own calendar.

I know I’m going to want to work on personal stuff, surf a bit, and do some reading. Knowing that I’m a morning person I start my day off before work on something for myself. Usually it falls under reading, writing, or coding.

After that I usually scan through my feed reader and read anything that catches my eye, surf a few regular sites and check my personal email.

When I do start work the first thing I do is about one hour of reactive work. All of those notifications and requests needing my attention get worked on. Right now this usually looks like catching up on email, Slack, Github, and Trello.

I then try to have a few chunks of time blocked off in my calendar throughout the week as “Focused Work”. This creates space to focus on what I need to work on while leaving time to be available for anyone else.

The key has been managing my own calendar to allow time for my attention to be directed on the things I want, the things I need, and the needs of others.

I do keep a running text file where I write out the important or urgent things that need to be done for the day. I’ll also add notes when something interrupts me. When I used to write this out on paper I used this sheet from Dave Seah called The Emergent Task Timer. I found it helped to write out what needs to be done each day and to track what things are pulling my attention away from more important things.

Because the type of work that I do can be interrupt driven I orientate my team in a similar fashion. By creating that same uninterrupted time for everyone it allows us to minimize the impact of distractions. During the week there are blocks of time where one person is the interruptible person while everyone else snoozes notifications in Slack, ignores email and gets some work done.

This also means focusing on minimizing the impact of alert / notification fatigue. Have a zero tolerance on things that page you repeatedly or adds unnecessary notifications to your day

The key really is just those four things I listed at the beginning.

You have to be clear in what is important both for yourself and for those you’re accountable to.

You have to be honest with yourself about your own habits. If you keep telling yourself you’re going to do something, but you never do… well there’s probably something else at play. Maybe you’re more in love with the idea rather than with the doing it.

You need to protect your attention from being pulled away for things that are not important or urgent.

You can work towards those three points by managing your own calendar. Besides tracking the passage of days a calendar schedules where your attention will be focused on. If you don’t manage it others will manage it for you.

I view these as ideals that I try to aim for. The circumstances I’m in might not allow me to always to so, but it does give me direction on how I work each day.

And yeah when I’m in the office I use headphones.

Why We Use Buzzwords

WebOps, SecOps, NetOps, DevOps, DevSecOps, ChatOps, NoOps, DataOps. They’re not just buzzwords. They’re a sign of a lack of resources.

I believe one reason why people keep coming up with newer versions of *Ops is there isn’t enough time or people to own the priorities that are being prepended to it. Forgot about the acronyms and the buzzwords and ask yourself why a new *Ops keep coming up.

It’s not always marketing hype. When people are struggling to find the resources to address one or multiple concerns they’ll latch onto anything that might help.

When systems administrators started building and supporting services that ran web sites we started calling it WebOps to differentiate that type of work from the more traditional SA role.

When DevOps came about, it was a reaction from SAs trying to build things at a speed that matched the needs of the companies that employed them. The success of using the label DevOps to frame that work encouraged others to orientate themselves around it as a way to achieve that same level of success.

Security is an early example and I can remember seeing talks and posts related to using DevOps as a framework to meet their needs.

Data related work seemed to be mostly absent. It seems instead we got a number of services, consultants, and companies that focused around providing a better database or some kind of big data product.

Reflecting back there are two surprises. The first being an early trend of including nontechnical departments in the DevOps framework. Adding your marketing department was one I saw a couple of different posts and talks on.

The second is even with the Software Defined Networking providing a programmatic path for traditional network engineering in the DevOps framework it really hasn’t been a standout. Most of the tools available are usually tied to big expensive gear and this seems to have kept network engineering outside of DevOps. The difference being if you are using a cloud platform that provides SDN then DevOps can cover the networking side.

ChatOps is the other interesting one, because it’s the one focused on the least technical thing. The difficulty people have communicating with other parts of their company and with easily finding information to basic questions that frequently come up.

This got me thinking of breaking the different types of engineering and/or development work needed into three common groups. And the few roles that have either been removed from some companies, outsourced, or they’re understaffed.

The three common groups include software engineer for front-end, back-end, mobile, and tooling. Systems engineering to provide the guidance and support for the entire product lifecycle; including traditional SA and operations roles, SRE, and tooling. The third is data engineering which covers traditional DBA roles, analytics, and big data.

Then you have QA, network engineering, and information security which seem to either have gone away unless you’re at a company that’s big enough to provide a dedicated team or has a specific requirement for them.

For QA it seems we’ve moved it from being a role into a responsibility everyone is accountable to.

Network engineering and security are two areas that I’m starting to wonder if they’ll be dispersed across the three common groups just as QA has been for the most part. Maybe SDN would move control of the network to software and system engineers? There is an obvious benefit to hold everyone accountable for security as we’ve done with QA.

This wouldn’t mean there’s no need for network and security professionals anymore, but that you would probably only find them in certain circumstances.

Which brings me back to my original point. Why the need for all of these buzzwords? With the one exception every single version of*Ops is focused on the non-product feature related work. We’re seeing reductions in the number of roles focused on that type of work while at the same time its importance has increased.

I think it’s important that we’re honest with ourselves and consider that maybe the way we do product management and project planning hasn’t caught up with the changes in the non-product areas of engineering. If it had do you think we would have see all of these new buzzwords?

It Was Never About Ops

For a while I’ve been thinking about Susan J. Fowler’s Ops Identitfy Crisis post. Bits I agreed with and some I did not.

My original reaction to the post was pretty pragmatic. I had concerns (and still do) about adding more responsibilities onto software engineers (SWE). It’s already fairly common to have them responsible for QA, database administration and security tasks but now ops issues are being considered as well.

I suspect there is an as of yet unmeasured negative impact to the productivity of teams that keep expanding the role of SWE. You end up deprioritizing the operations and systems related concerns, because new features and bug fixes will always win out when scheduling tasks.

Over time I’ve refined my reaction to this. The traditional operations role is the hole that you’ve been stuck in and engineering is how you get out of that hole. It’s not that you don’t need Ops teams or engineers anymore. It’s simply that you’re doing it wrong.

It was never solely about operations. There’s always been an implied hint of manual effort for most Ops teams. We’ve seen a quick return from having SWE handle traditional ops tasks, but that doesn’t mean that the role won’t be needed anymore. Previously we’ve been able to add people to ops to continue to scale with the growth of the company, but those days are gone for most. What needs to change is how we approach the work and the skills needed to do the job.

When you’re unable to scale to meet the demands you’ll end up feeling stuck in a reactive and constantly interruptible mode of working. This can then make operations feel more like a burden rather than a benefit. This way of thinking is part of the reason why I think many of the tools and services created by ops teams are not thought of as actual products.

Ever since we got an API to interact with the public cloud and then later the private cloud we’ve been trying to redefine the role of ops. As the ecosystem of tools has grown and changed over time we’ve continued to refine that role. While thinking on the impact Fowler’s post I know that I agree with her that the skills needed are different from they were eight years ago, but the need for the role hasn’t decreased. Instead it’s grown to match the growth of the products it has been supporting. This got me thinking about how I’ve been working during those eight years and looking back it’s easy to see what worked and what didn’t. These are the bits that worked for me.

First don’t call it ops anymore. Sometimes names do matter. By continuing to use “Ops” in the team name or position we continue to hold onto that reactive mindset.

Make scalability your main priority and start building the products and services that become the figurative ladder to get you out of the hole you’re in. I believe you can meet that by focusing on three things: Reliability, Availability, and Serviceability.

For anything to scale it first needs to be reliable and available if people are going to use it. To be able to meet the demands of the growth you’re seeing the products need to be serviceable. You must be able to safely make changes to them in a controlled and observable fashion.

Every product built out of these efforts should be considered a first class citizen. The public facing services and the internal ones should be considered equals. If your main priority is scaling to meet the demands of your growth, then there should be no difference in how you design, build, maintain, or consider anything no matter where it is in the stack.

Focus on scaling your engineers and and making them force multipliers for the company. There is a cognitive load placed on individuals, teams, and organizations for the work they do. Make sure to consider this in the same way we think of the load capacity of a system. At a time where we’re starting to see companies break their products into smaller more manageable chunks (microservices), we’re close to doing the exact opposite for our people and turning the skills needed to do the work into one big monolith.

If you’ve ever experienced yourself or have seen any of your peers go through burnout what do you think is going to happen as we continue to pile on additional responsibilities?

The growth we’re seeing is the result of businesses starting to run into web scale problems.

Web scale describes the tendency of modern sites – especially social ones – to grow at (far-)greater-than-linear rates. Tools that claim to be “web scale” are (I hope) claiming to handle rapid growth efficiently and not have bottlenecks that require rearchitecting at critical moments. The implication for “web scale” operations engineers is that we have to understand this non-linear network effect. Network effects have a profound effect on architecture, tool choice, system design, and especially capacity planning.

Jacob Kaplan-Moss

The real problem I think we’ve always been facing is making sure you have the people you need to do the job. Before we hit web scale issues we could usually keep up by having people pull all nighters, working through weekends or if you’re lucky hiring more. The ability for hard work to make up for any short comings in planning or preparations simply can no longer keep up. The problem has never been with the technology or the challenges you’re facing. It’s always been about having the right people.

In short you can …

  1. Expect services to grow at a non-linear rate.
  2. To be able to keep up with this growth you’ll need to scale both your people and your products.
  3. Scale your people by giving them the time and space to focus on scaling your products.
  4. Scale your products by focusing on Reliability, Availability, and Serviceability.

To think that new tools or services will be the only (or main) answer to the challenges you’re facing brought on from growth is a mistake. You will always need people to understand what is happening and then design, implement, and maintain your solutions. These new tools and services should increase the impact of each member of your team, but it is a mistake to think it will replace a role.