Wednesday, July 07, 2010

Cause of DBS's banking failure?

DBS, one of the largest banks in this part of the world, experienced massive outage of its online banking and ATM facilities on Monday.

The timing of the failure is interesting, because 5 July 2010 does not seem to be a significant date. So what could've caused the problem?

I have no inside information but the following:

1) The payment card industry (PCI) has data security standards (DSS). Version 1.2 of the standard called for the switchover from SSLv2 to SSLv3, on 1 July 2010.[1][2][3]

2) According to the CNA article, the bank discovered the problem at 3 am. As it turns out, 3 am is exactly 99 hours from midnight, 1 July 2010.

My conjecture is that the terminals have a "heartbeat" and would attempt to connect to the back-end systems on an hourly basis, but would be unable to do so since the protocol switchover happened on 1 July 2010. The data structure logging the re-connect attempts then ran into a buffer overflow when 4 am rolled around since 99+1 = 100.[4] If this exception is left unhandled then it would cause the system to halt.

If this was indeed the root cause of failure then the band-aid fix would probably require a reconfiguration of the terminals to use SSLv3 or TLSv1.2. The real solution is to improve test coverage and perform a code review. At the operations level DBS would have to audit its processes to ensure compliance in future.

[1]http://www.pcicomplianceguide.org/pcifaqs.php#18
[2]http://blog.zenone.org/2009/03/pci-compliance-disable-sslv2-and-weak.html
[3]http://www.rapid7.com/vulndb/lookup/sslv2-enabled
[4]The offending data type is probably a numeric fixed-point packed decimal declared to 2 decimal places, a.k.a. PIC 99 in COBOL.

No comments: