by James Smith, Bugsnag CEO
Over the last three months, the media including Cyber Defense Magazine has reported what seems to be one computer bug after another, including the Meltdown and Spectre Intel CPU bug, the Apple ‘chaiOS’ bug that crashes the Messages app, crypto-currency mining infections, and the latest uTorrent bug that lets websites control computers to steal information.
Let’s face it. The technology we rely on for business and pleasure today has bugs. While overall code quality has improved, with better programming practices and automated quality assurance testing, there is so much more software now than there was even five years ago that it is almost inevitable that some bugs will make it out of development into production.
Those with the biggest impact will inevitably grab headlines, but many of us are responsible for applications that are critical for our organizations and customers, and we are not immune to this trend. So it is important for every software professional to ensure that robust systems are in place to catch bugs in development as well as in production.
Over the last several decades, computer bugs have been responsible for major data breaches, privacy issues, and crashed rockets and trains. One class of bugs with security holes that malware exploited led to the explosion of “bug bounties.” Last year, just one company, Google, offered to pay $1 million to external researchers to find security bugs within their code.
But just as important are bugs that cause the software to behave in unexpected ways, or to crash altogether. These bugs can slip through QA tests, especially when they involve new devices, interactions between humans and the software, or unintended usage that has not been tested for.
As a reminder of just how far-reaching these threats are, let’s take a step back and remember how long computer bugs have been an issue and take note of some of the major havoc they’ve caused.
Many people don’t remember that the first recorded computer “bug” was an actual moth discovered in 1943 by Grace Hopper stuck in between the relays of the Harvard Mark II computer. Her notes of the encounter involved the very first use of the term “debugging” in relation to computers.
Since then, software bugs have been a fact of life. Here are four of the most notorious and costly bugs of the last 40 years in terms of money and lives.
Therac-25 causes radiation overdoses
One of the most tragic software bug stories involves the Therac-25, a machine meant to deliver radiation therapy to cancer patients. In 1985, concurrent programming errors caused the machine to mistakenly deliver an overdose of radiation hundreds of times greater than normal, killing three patients and causing debilitating injuries to at least three others.
The previous model of this device had both hardware and software controls to ensure the correct dose of radiation, but the Therac-25 model removed the hardware controls under the false assumption that the software controls were sufficient. It turns out there was an error in the software controls that when combined with human interaction could lead to a deadly overdose. The new device was never fully tested with the software-only controls. The only indication that something was wrong was an ambiguous error code that did not state the severity of the error and did not block the operator from continuing to administer a fatal dose.
A 64-bit software bug squeezes into a 16-bit processor to crash the first Ariane 5 rocket
To the horror of onlookers and European Space Agency employees, a software bug caused the first Ariane 5 rocket to flip 90 degrees and explode shortly after liftoff on June 4th, 1996. The ESA used the Ariane 5 to deliver payloads in space for low Earth orbits. The cost of the crash exceeded $370 million.
The fault was identified as a software bug in the rocket’s Inertial Reference System used to determine whether it was pointing up or down based on a 16-bit integer. Again, this software was a holdover from the previous generation rocket, which reported velocities using a 16-bit integer. The new Ariane 5 was faster and more sensitive and used a 64-bit floating point value for its velocity reports. This mismatch between the two systems proved to be disastrous.
For the first few seconds of flight, the rocket’s acceleration was low, so the conversion between these two values was successful. However, as the rocket’s velocity increased, the 64-bit variable exceeded the capacity of the 16-bit integer in the IRS software. Suddenly, the rocket thought it was pointed in the wrong direction, and the software overcorrected, essentially flipping the rocket in mid-flight leading to failure.
Mars Climate Orbiter burns up in space
On Sept. 23, 1999, NASA lost its $235 million Mars climate orbiter spacecraft because of two software teams using two different units of measurement, imperial units of pounds-seconds, and metric units of newton-seconds. In an unfortunate fluke, at distances close to the earth, the units looked nearly identical numerically, and the differences were not caught in pre-flight testing.
Larger deviations as the mission progressed were attributed to solar wind or dust particles affecting the orbiter speed, which were expected and manually corrected. But by the time the orbiter approached Mars, the accumulated error was too much, and the time lag between Mars and Earth left no time for corrections when things started to go wrong. The incorrect calculation of the spacecraft’s trajectory led it to burn up in the Martian atmosphere.
This story is a good reminder to make sure software teams are working closely, to test the full range of anticipated conditions before releasing, and to not make assumptions about the source of errors before correcting them.
Knight Capital Loses $460 million in 45 minutes
On Aug. 1, 2012, Knight Capital deployed a new software update to its production server. What they did not notice is that someone had accidentally reactivated in the production software defunct internal testing subroutine that was last used in 2003. This subroutine was designed to stress test the software for trade volume, generating a high number of trades without regard to whether they were good trades or not.
Just 45 minutes into trading that day, the misconfigured, outdated program generated more than 4 million faulty trades resulting in losses. This was compounded by other trading systems detecting these seemingly bizarre trades and amplifying them for their own gain. Ultimately, these 45 faulty minutes cost the financial firm more than $460 million on behalf of one of its retail investors and later resulted in an SEC fine of $12 million.
An SEC report later determined the problem was based on a lack of formal code reviews and quality assurance processes that might have identified and removed the dead code that produced the error.
History has shown that computer bugs can come in a variety of forms – from the outside by cyber criminals or from the inside due to lack of oversight and inadequate programming processes, bad assumptions and human error.
Robust quality assurance systems and automated testing tools can help reduce errors released to production. Bug bounty programs, both external and internal, can be helpful in identifying security bugs and performance issues in production as can the judicious use of ethical hackers. Tools such as APM and logging can help identify performance issues in production that are caused by infrastructure issues or by software inefficiencies. And production error monitoring tools can capture live errors from production for analysis, prioritization, and resolution long before trouble tickets and support calls identify such issues.
With these systems in place, organizations can be proactive about security, performance, and stability of their applications and instill a culture of quality and end-to-end ownership of the application. At the end of the day, it all comes down to prevention and early detection.
About the Author
Running Bugsnag as CEO and co-founder.
Entrepreneur and software engineer with a broad base of experience. Passionate about building great products, growing teams and scaling infrastructure and data.
I’ve also created a number of popular open source projects (https://github.com/loopj) which are used by companies such as Twitter, Pinterest and Trello.