Disaster at Knight: Only One Thing Could Have Gone Wrong
They botched the Software Quality Assurance (QA)…stating the obvious…Res Ipsa Loquitur…(It speaks for itself). But the $400 million dollar question is HOW they botched the application QA.
This is not the first time in my career I have seen this. It is the third.
The first time was a UK company in the newly formed UK power market. Big vendor developers had botched the (manual) bidding code such that it posted bids as asks and vice versa. $80 million down the drain before they found the error.
The second was a hedge fund that had botched the automated booking process in their ETRM system causing the portfolio to be mis-hedged.
First some background on QA steps.
COMPILE
After code is written, it is compiled. In a way compiling code is a form of testing. If the code compiles, it means there are no obvious errors in the code that prevent it from running. But running is different than working as intended. The dirty secret of software development is a testing manifesto of, “if it compiles it ships.” Do software updates come to you where the application starts and runs fine, but then interacting with it throws errors? You’ve got a team adhering to, “if it compiles it ships.”
There are many types of testing, but without getting into semantic games, here’s a simple breakdown.
UNIT TESTING
Unit testing is the lowest level of code testing. It is written by programmers and focuses on independently testing the smallest units of code. Once unit tests are fully in-place, they can be run anytime code is added or modified to ensure that existing functionality has not been affected. This allows complex systems to be dynamic and still stable. What unit testing does not do it test how various units function together.
FUNCTIONAL TESTING
Once each individual unit of code works as intended, the full code base need to be tested together. In other words, does the whole application function as intended. The scope of such testing is captured in ‘Test Scripts’ which describe a set of discrete system actions with data inputs that produce a set of expected results. In other words, I type my ID and password and click ‘Login’ with an expected result of logging in successfully. These can be run manually or even automated with software tools since they are the same actions/results each time.
This phase of testing is really only as good as the testers or the person who automated the scripts. I had a buddy running development and the testing consultants (who seemed to come out of a yellow bus with backpacks) said the testing had failed. The test script said “Enter a pipeline company” in XYZ field. They failed it because there was no company called “A Pipeline Company” in the drop down. I wish I was kidding.
UAT
Finally, the last step is User Acceptance Testing or UAT. This is where the users, those who use the application on a day to day basis, actually test the application for themselves. In addition to testing their day-to-day activities (which may be just running through test scripts) they should also conduct what I like to call ‘monkey testing’. In other words, beat on the system like a monkey and try and break the darn thing. This is an important catch-all phase however it’s always tough because getting time from the actual users is a challenge. But at the end of the day they have to put their name on it.
Where’d Knight Go Wrong?
If you have to point a finger, UAT is the place where blame is assigned. As a publicly traded company there are some Sarbanes-Oxley implications to not running through test scripts. One project we were on for a large enterprise application, the test scripts had to be put in paper form for SOX and they literally filled a 10×12 room.
I don’t know if offshore resources were involved. In recent years there has been a large push to do testing offshore. But my experience is that coordinating testing in a remote location adds another level of complexity to the mix. For running test scripts in a far-away land, you are literally banking that a remote person whom you’ve never met knows the difference between a bid and ask, a put and a call, a long call versus a short put. My experience is that this distance can foster a lack of accountability. But most of all, do you know what it is like to get a trader to talk to the testing team at 9PM? Brutal.