Australia 2016 – #Censusfail

 

 

There’s been no shortage of pain for both the ABS, and Australians trying to fill out the census. This article is going to attempt to look at what and why this outage may have occurred.

Was the Census ‘hacked’?

No. The ABS has claimed the Census was the target for a Denial of Service (DOS) attack. This is an attempt to flood the available servers, with fake requests, to the extent that legitimate users cannot connect to the server.

Was the Census attacked by foreigners?

It’s extremely unlikely. Currently, the census site is not accessible from outside Australia. This may be an attempt to fix the issues the site has been having. However, security experts have not seen an influx of traffic, which you normally associate with a foreign-originating distributed denial of service (DDOS) attack.

Could the census site cope with the load?

There appears to be a few questionable decisions in the design of the Census site. First, it appears as if the census site isn’t using a content delivery network (CDN). Whether queried from Sydney or Melbourne, the census site appears to be served from Melbourne. It looks like IBM have rented some servers from “Nextgen Networks”, and hosted the site there.

If a content delivery network were in place, then requests from Sydney would be served by a host of smaller cache servers closer to the user. Failure to use a CDN, means that it becomes harder and harder to scale a web site to server many users at once. Further, it actually means the servers work harder, the slower a user’s internet is, and the further it is from the servers. This could potentially explain why a lesser number of connections from outside Australia might initially look like a DOS attack.

CDNs are cheap, effective, and in use by government already. They are a commoditised solution with many providers such as Akamai, Amazon Web Services and Section.io providing cheap, capable services. It’s hard to imagine why one wouldn’t be in use, especially when you consider the number of simultaneous connections they should be expecting, which brings me to the next question.

It passed ‘load testing’, so why would it fail if it wasn’t attacked?

The simple answer is that the ABS tested double the estimated average load, rather than double the estimated peak load. If your testing is predicated on census users gradually filling it out during the day in an orderly fashion, rather than filling it out directly after work or after their meal, then it’s obviously going to be making the wrong assumptions. Most people work 9 – 5, and will be filling out the census between 18:00 and 22:00, at a guess. That means that the average over that period, could exceed 2 million hits per hour, with an actual peak more like 4.

So, it cost $9.6M and didn’t do the job?

No! It cost more like $27M, and didn’t do the job. The $9.6M is just one of the sections IBM’s billing.

This table was posted to Linkedin by Matt Barrie, CEO of Freelancer.com:

Summary of Costings for Census 2016 - Australia

A summary of the cost for the Australian Census 2016

How would a CDN help, isn’t the content dynamic?

Yes and no. If you are running a finite number of servers (behind and in front of load balancers, as panic set in over the course of last night) then it doesn’t make sense to buy/rent servers in one geographic location to serve a whole country. When you consider hundreds of requests required to render one page, and multiply that by millions of page views, suddenly the round-trip time, time spent fulfilling static requests, and overhead in load-balancing/initiating and terminating connections become massive, massive problems. As they exceed the ratings/tolerances/memory/bandwidth available, the problems start to cascade. Why wouldn’t you leverage a content delivery network, that already has infrastructure in place to route huge percentages of the internets traffic seamlessly, rather than roll your own solution. Not only do you need to buy redundancy and overhead that you will never use again after the census (increase cost), but you also have to hope that your staff in architecture, implementation, and operations are up to the task. For hundreds of thousands of dollars, companies like the above-mentioned CDN’s which run the world’s news sites, sports sites, and top retail sites networks, would have made sure the site stayed up, or that the failures were better managed.
EDIT 19:00:

The ABC has an officially released timeline. It doesn’t detract from anything I’ve so far written. A CDN is built to be able to turn off geographic regions, without failing. That’s just a fundamental requirement. IBM has likely spent a lot more money, replicating a failed version of this feature.

The government has stepped away from saying this is a DOS attack. Are they saying it’s accidental? The correct term for DOS that isn’t an attack, is heavy load. The kind you get when you project 500,000 hits an hour for the census…

2 thoughts on “Australia 2016 – #Censusfail

    • A CDN is a property of a bit of your virtual network. If you have a Facebook like button, then at least that part of your site is using a CDN.

      But saying the census site is on a CDN, is like claiming you’re using a carbon fibre bike, because you have a carbon fibre drink holder. Sure, you are using a CDN, but you aren’t using a CDN to do all the things it can do.

      At the moment, the wonderful engineers have set a TTL of 0 to the DNS data for the census site. So you’ll tend to see the census site come up and come down, as their DNS server chokes. This is probably because the value of 4 hours they previously had, was far FAR too high. Problem is, a value of 0 means they are getting thrashed.

      This is evidence they aren’t using a proper, scalable, DNS setup. This is very very cheap.

      The reason it matters? Because they are using this DNS “trickery” to bounce users between servers. Great work, I’m so glad I’m connected to “https://stream10.census.abs.gov.au/eCensusWeb/welcome.jsp”. They should have a load balancer in front (I think theirs failed), or be using a CDN configuration to load balance. What would that look like? Well even if you fail to make your pages cacheable, and just plonk a CDN in front, everything will look like it comes from census.abs.gov.au, but the identity of census.abs.gov.au will look different to different geographic reasons. You get natural, geographic (down to a suburb level, according to some CDNs) load balancing effect. Plus, if your site goes down, it goes down for one region, which can prompt you to spin up another server, and split traffic up there.

      If you are using a CDN, then even when your servers go down, your site can still return a static, useful page, telling people what’s going on. If properly done, people can wait there, and click a button to attempt to resubmit data. The fact they haven’t done this (and aren’t properly using a CDN) indicates they were completely convinced that everything would be fine. Their mitigations so far have been quite poor.

      So, maybe they are using a CDN. They are either using a terrible one (I have no experience with IBM’s softlayer CDN) or they are using it terribly. I think it’s the latter.

      The fact that the ABC’s, official, published, timeline includes “[geolocation server fell over, now all the traffic was piling in]” tells us that their architecture, is nothing like what you’d expect to mitigate this kind of traffic. The fact that these CDN products are mature, and compared to the prices IBMs been paying for VMware licenses, etc, very cheap. It looks like they’ve opted to mis-use their own CDN.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s