Australia 2016 – #Censusfail

 

 

There’s been no shortage of pain for both the ABS, and Australians trying to fill out the census. This article is going to attempt to look at what and why this outage may have occurred.

Was the Census ‘hacked’?

No. The ABS has claimed the Census was the target for a Denial of Service (DOS) attack. This is an attempt to flood the available servers, with fake requests, to the extent that legitimate users cannot connect to the server.

Was the Census attacked by foreigners?

It’s extremely unlikely. Currently, the census site is not accessible from outside Australia. This may be an attempt to fix the issues the site has been having. However, security experts have not seen an influx of traffic, which you normally associate with a foreign-originating distributed denial of service (DDOS) attack.

Could the census site cope with the load?

There appears to be a few questionable decisions in the design of the Census site. First, it appears as if the census site isn’t using a content delivery network (CDN). Whether queried from Sydney or Melbourne, the census site appears to be served from Melbourne. It looks like IBM have rented some servers from “Nextgen Networks”, and hosted the site there.

If a content delivery network were in place, then requests from Sydney would be served by a host of smaller cache servers closer to the user. Failure to use a CDN, means that it becomes harder and harder to scale a web site to server many users at once. Further, it actually means the servers work harder, the slower a user’s internet is, and the further it is from the servers. This could potentially explain why a lesser number of connections from outside Australia might initially look like a DOS attack.

CDNs are cheap, effective, and in use by government already. They are a commoditised solution with many providers such as Akamai, Amazon Web Services and Section.io providing cheap, capable services. It’s hard to imagine why one wouldn’t be in use, especially when you consider the number of simultaneous connections they should be expecting, which brings me to the next question.

It passed ‘load testing’, so why would it fail if it wasn’t attacked?

The simple answer is that the ABS tested double the estimated average load, rather than double the estimated peak load. If your testing is predicated on census users gradually filling it out during the day in an orderly fashion, rather than filling it out directly after work or after their meal, then it’s obviously going to be making the wrong assumptions. Most people work 9 – 5, and will be filling out the census between 18:00 and 22:00, at a guess. That means that the average over that period, could exceed 2 million hits per hour, with an actual peak more like 4.

So, it cost $9.6M and didn’t do the job?

No! It cost more like $27M, and didn’t do the job. The $9.6M is just one of the sections IBM’s billing.

This table was posted to Linkedin by Matt Barrie, CEO of Freelancer.com:

Summary of Costings for Census 2016 - Australia

A summary of the cost for the Australian Census 2016

How would a CDN help, isn’t the content dynamic?

Yes and no. If you are running a finite number of servers (behind and in front of load balancers, as panic set in over the course of last night) then it doesn’t make sense to buy/rent servers in one geographic location to serve a whole country. When you consider hundreds of requests required to render one page, and multiply that by millions of page views, suddenly the round-trip time, time spent fulfilling static requests, and overhead in load-balancing/initiating and terminating connections become massive, massive problems. As they exceed the ratings/tolerances/memory/bandwidth available, the problems start to cascade. Why wouldn’t you leverage a content delivery network, that already has infrastructure in place to route huge percentages of the internets traffic seamlessly, rather than roll your own solution. Not only do you need to buy redundancy and overhead that you will never use again after the census (increase cost), but you also have to hope that your staff in architecture, implementation, and operations are up to the task. For hundreds of thousands of dollars, companies like the above-mentioned CDN’s which run the world’s news sites, sports sites, and top retail sites networks, would have made sure the site stayed up, or that the failures were better managed.
EDIT 19:00:

The ABC has an officially released timeline. It doesn’t detract from anything I’ve so far written. A CDN is built to be able to turn off geographic regions, without failing. That’s just a fundamental requirement. IBM has likely spent a lot more money, replicating a failed version of this feature.

The government has stepped away from saying this is a DOS attack. Are they saying it’s accidental? The correct term for DOS that isn’t an attack, is heavy load. The kind you get when you project 500,000 hits an hour for the census…

Accessing Filemaker Pro Server 11 via ODBC/SQL

Anyone who has tried to access Filemaker Pro Server (and Server Advanced) data via ODBC, has probably run into the myriad of weird issues. In this post, I’ll try and catalogue some of them, to help anyone else forced into this desperate method of communication.
Continue reading

10 ZFS on Ubuntu/Linux Learnings

I thought I’d go through a few learnings from running the ZOL package on Ubuntu. Some are general observations on zfs, some are ZOL specific, and some are just issues to avoid. For those who aren’t familiar with ZFS, it’s a filesystem capable of storing incredible quantities of data. It’s able to detect and fix bit rot, it’s able to do a whole swathe of cool tricks such as live snapshots, deduplication, and compression.

These are in no particular order.

1. If you can, and you have data you want to treat differently, split them up as low as you can on the pool’s structure. It’s easy to configure one part of the pool to have deduplication or one part to have compression with a particular algorithm. It’s a bad idea to turn it one for a whole pool of heterogeneous data. Deduplication in general, is a performance killer. If you turn it on, do it for a small set of highly duplicated data, and it’s a valuable feature. If you turn it on for your whole pool, then everything from deleting snapshots, to writing large, unduplicated datasets becomes a huge chore. The same goes for compression.

2. Less is often more – choose the best feature, and use just that feature. Either compression or deduplication. Don’t go overboard with snapshots either, or disable them for data storage that has high turnover. Otherwise your snapshots will bloat out to many times the size of the base data at any point in time.

3. Deleting snapshots that are more recent, seems quicker than deleting older ones (with dedup turned on). Deleting snapshots with dedup on is a PITA.

4. If you are using SATA drives, instead of SAS drives then you have NCQ instead of TCQ. Too many acronyms? ZFS is configured by default to make great use of SAS drives, and to give SATA drives a headache. Set zfs_vdev_max_pending=1 and zfs_vdev_min_pending=1 using the following commands:

echo "1" > /sys/module/zfs/parameters/zfs_vdev_max_pending
echo "1" > /sys/module/zfs/parameters/zfs_vdev_min_pending

5. raidz doesn’t help you read speed. It’s good for writing and redundancy, but not for reading. It often caps at the speed of the slowest drive. So put mirrored groups inside your raidZ. With six drives, you could run a raidz1 of 3 x 2 mirror sets.

6. Don’t let it fill up!
This one should be in caps: DON’T LET YOUR RAIDZ FILL UP. Performance will drop off a cliff, on many systems it’s effectively an outage at 95%+ drive capacity used. Your disk I/O will fill up with sync processes.

7. L2Arc is great. Use it!

8. Don’t add drives to a raidz after initial creation – The extra capacity isn’t covered by the redundancy! If you have a raid array of 3TB, and you add an extra 1TB drive, you gain another TB of capacity – but this capacity is only located on one drive, and the loss of that drive means the loss of the data stored on it!!

9. zfs destroy -r tank will not only destroy tank, but everything related to it, including snapshots. If you need to make a copy of your data, and destroy the original, then copy the filesystem.

10. This one shouldn’t even need to be listed here. But monitor your ZPOOL status. You want to know if it’s degraded as soon as it’s degraded!

Installing OpenSSH 6.6p1 on Ubuntu 13.10 “Saucy”

Ubuntu 14.04 includes a new version of OpenSSH. Version 6.2, which is present in Saucy has a vulnerability which potentially could lead to it being exploited. If you are doing PCI compliance, then fixing this is a must!

Luckily, replacing OpenSSH is pretty easy. This install assumes you are using 13.10, and that you already have OpenSSH installed and configured, and that you have already installed the Ubuntu 13.10 build dependencies. If you’ve ever compiled something from source, then you will have this. If you don’t, then use  Continue reading

Installing a virtualenv when python 2 and 3 are both installed

When running Python 2 AND python 3 on the same system, at least on Mac OS X, you can run into some interesting problems. Mac OS X comes with an outdated version of python installed by default. Using tools like homebrew, you can install a more up to date version, such as 2.7 or python 3+. In fact, you can install both, and use both quite successfully. Naturally virtualenv is pretty damn useful. If you aren’t aware, virtualenv allows you to set up an environment to run an application in, using it’s own set of installed python modules, and even it’s own cpython binary. You can use this to test updating a python module, you can also use it to check that your install scripts work without all the prerequisites already installed on your computer. It’s a very versatile tool!

Sometimes though, you run into issues. And while they are sometimes very easy to work around, there aren’t necessarily great fixes popping up all over the internet. Continue reading