Australia 2016 – #Censusfail

 

 

There’s been no shortage of pain for both the ABS, and Australians trying to fill out the census. This article is going to attempt to look at what and why this outage may have occurred.

Was the Census ‘hacked’?

No. The ABS has claimed the Census was the target for a Denial of Service (DOS) attack. This is an attempt to flood the available servers, with fake requests, to the extent that legitimate users cannot connect to the server.

Was the Census attacked by foreigners?

It’s extremely unlikely. Currently, the census site is not accessible from outside Australia. This may be an attempt to fix the issues the site has been having. However, security experts have not seen an influx of traffic, which you normally associate with a foreign-originating distributed denial of service (DDOS) attack.

Could the census site cope with the load?

There appears to be a few questionable decisions in the design of the Census site. First, it appears as if the census site isn’t using a content delivery network (CDN). Whether queried from Sydney or Melbourne, the census site appears to be served from Melbourne. It looks like IBM have rented some servers from “Nextgen Networks”, and hosted the site there.

If a content delivery network were in place, then requests from Sydney would be served by a host of smaller cache servers closer to the user. Failure to use a CDN, means that it becomes harder and harder to scale a web site to server many users at once. Further, it actually means the servers work harder, the slower a user’s internet is, and the further it is from the servers. This could potentially explain why a lesser number of connections from outside Australia might initially look like a DOS attack.

CDNs are cheap, effective, and in use by government already. They are a commoditised solution with many providers such as Akamai, Amazon Web Services and Section.io providing cheap, capable services. It’s hard to imagine why one wouldn’t be in use, especially when you consider the number of simultaneous connections they should be expecting, which brings me to the next question.

It passed ‘load testing’, so why would it fail if it wasn’t attacked?

The simple answer is that the ABS tested double the estimated average load, rather than double the estimated peak load. If your testing is predicated on census users gradually filling it out during the day in an orderly fashion, rather than filling it out directly after work or after their meal, then it’s obviously going to be making the wrong assumptions. Most people work 9 – 5, and will be filling out the census between 18:00 and 22:00, at a guess. That means that the average over that period, could exceed 2 million hits per hour, with an actual peak more like 4.

So, it cost $9.6M and didn’t do the job?

No! It cost more like $27M, and didn’t do the job. The $9.6M is just one of the sections IBM’s billing.

This table was posted to Linkedin by Matt Barrie, CEO of Freelancer.com:

Summary of Costings for Census 2016 - Australia

A summary of the cost for the Australian Census 2016

How would a CDN help, isn’t the content dynamic?

Yes and no. If you are running a finite number of servers (behind and in front of load balancers, as panic set in over the course of last night) then it doesn’t make sense to buy/rent servers in one geographic location to serve a whole country. When you consider hundreds of requests required to render one page, and multiply that by millions of page views, suddenly the round-trip time, time spent fulfilling static requests, and overhead in load-balancing/initiating and terminating connections become massive, massive problems. As they exceed the ratings/tolerances/memory/bandwidth available, the problems start to cascade. Why wouldn’t you leverage a content delivery network, that already has infrastructure in place to route huge percentages of the internets traffic seamlessly, rather than roll your own solution. Not only do you need to buy redundancy and overhead that you will never use again after the census (increase cost), but you also have to hope that your staff in architecture, implementation, and operations are up to the task. For hundreds of thousands of dollars, companies like the above-mentioned CDN’s which run the world’s news sites, sports sites, and top retail sites networks, would have made sure the site stayed up, or that the failures were better managed.
EDIT 19:00:

The ABC has an officially released timeline. It doesn’t detract from anything I’ve so far written. A CDN is built to be able to turn off geographic regions, without failing. That’s just a fundamental requirement. IBM has likely spent a lot more money, replicating a failed version of this feature.

The government has stepped away from saying this is a DOS attack. Are they saying it’s accidental? The correct term for DOS that isn’t an attack, is heavy load. The kind you get when you project 500,000 hits an hour for the census…

Accessing Filemaker Pro Server 11 via ODBC/SQL

Anyone who has tried to access Filemaker Pro Server (and Server Advanced) data via ODBC, has probably run into the myriad of weird issues. In this post, I’ll try and catalogue some of them, to help anyone else forced into this desperate method of communication.
Continue reading

10 ZFS on Ubuntu/Linux Learnings

I thought I’d go through a few learnings from running the ZOL package on Ubuntu. Some are general observations on zfs, some are ZOL specific, and some are just issues to avoid. For those who aren’t familiar with ZFS, it’s a filesystem capable of storing incredible quantities of data. It’s able to detect and fix bit rot, it’s able to do a whole swathe of cool tricks such as live snapshots, deduplication, and compression.

These are in no particular order.

1. If you can, and you have data you want to treat differently, split them up as low as you can on the pool’s structure. It’s easy to configure one part of the pool to have deduplication or one part to have compression with a particular algorithm. It’s a bad idea to turn it one for a whole pool of heterogeneous data. Deduplication in general, is a performance killer. If you turn it on, do it for a small set of highly duplicated data, and it’s a valuable feature. If you turn it on for your whole pool, then everything from deleting snapshots, to writing large, unduplicated datasets becomes a huge chore. The same goes for compression.

2. Less is often more – choose the best feature, and use just that feature. Either compression or deduplication. Don’t go overboard with snapshots either, or disable them for data storage that has high turnover. Otherwise your snapshots will bloat out to many times the size of the base data at any point in time.

3. Deleting snapshots that are more recent, seems quicker than deleting older ones (with dedup turned on). Deleting snapshots with dedup on is a PITA.

4. If you are using SATA drives, instead of SAS drives then you have NCQ instead of TCQ. Too many acronyms? ZFS is configured by default to make great use of SAS drives, and to give SATA drives a headache. Set zfs_vdev_max_pending=1 and zfs_vdev_min_pending=1 using the following commands:

echo "1" > /sys/module/zfs/parameters/zfs_vdev_max_pending
echo "1" > /sys/module/zfs/parameters/zfs_vdev_min_pending

5. raidz doesn’t help you read speed. It’s good for writing and redundancy, but not for reading. It often caps at the speed of the slowest drive. So put mirrored groups inside your raidZ. With six drives, you could run a raidz1 of 3 x 2 mirror sets.

6. Don’t let it fill up!
This one should be in caps: DON’T LET YOUR RAIDZ FILL UP. Performance will drop off a cliff, on many systems it’s effectively an outage at 95%+ drive capacity used. Your disk I/O will fill up with sync processes.

7. L2Arc is great. Use it!

8. Don’t add drives to a raidz after initial creation – The extra capacity isn’t covered by the redundancy! If you have a raid array of 3TB, and you add an extra 1TB drive, you gain another TB of capacity – but this capacity is only located on one drive, and the loss of that drive means the loss of the data stored on it!!

9. zfs destroy -r tank will not only destroy tank, but everything related to it, including snapshots. If you need to make a copy of your data, and destroy the original, then copy the filesystem.

10. This one shouldn’t even need to be listed here. But monitor your ZPOOL status. You want to know if it’s degraded as soon as it’s degraded!

Installing OpenSSH 6.6p1 on Ubuntu 13.10 “Saucy”

Ubuntu 14.04 includes a new version of OpenSSH. Version 6.2, which is present in Saucy has a vulnerability which potentially could lead to it being exploited. If you are doing PCI compliance, then fixing this is a must!

Luckily, replacing OpenSSH is pretty easy. This install assumes you are using 13.10, and that you already have OpenSSH installed and configured, and that you have already installed the Ubuntu 13.10 build dependencies. If you’ve ever compiled something from source, then you will have this. If you don’t, then use  Continue reading

Installing a virtualenv when python 2 and 3 are both installed

When running Python 2 AND python 3 on the same system, at least on Mac OS X, you can run into some interesting problems. Mac OS X comes with an outdated version of python installed by default. Using tools like homebrew, you can install a more up to date version, such as 2.7 or python 3+. In fact, you can install both, and use both quite successfully. Naturally virtualenv is pretty damn useful. If you aren’t aware, virtualenv allows you to set up an environment to run an application in, using it’s own set of installed python modules, and even it’s own cpython binary. You can use this to test updating a python module, you can also use it to check that your install scripts work without all the prerequisites already installed on your computer. It’s a very versatile tool!

Sometimes though, you run into issues. And while they are sometimes very easy to work around, there aren’t necessarily great fixes popping up all over the internet. Continue reading

Reading data from a 1W340A

PCsensor.com manufacture a range of temperature and humidity measuring equipment. Their offerings are cheap and cheerful, but ultimately fail in one key area: the software. You are (nearly) always limited to a closed-source windows option. The 1W340A allows you to log data over the network, which is a big step up from the generic USB HID type options of the past.

It still has the software problem though.

So what can be done? Well, the best option is to watch the 1W340A talk to the software, capture the data packets, and attempt to implement polling in another program. My program of choice is python. Continue reading

EXIF & Orientation: Applying rotation and stripping EXIF data

When working on a project that involves users uploading images, there’s a lot of things to look out for. One unexpected one, is EXIF data. Firstly, if you are hosting images uploaded to you, it’s best to strip EXIF data to prevent the internet at large from being able to see where GPS tagged images were taken (see celebrity twitter stalking). However there is another – orientation. A number of cameras, including the iPhone, store orientation data in the EXIF, and leave the original image the “correct” orientation with regard to the camera sensors. And that approach makes a lot of sense, however some browsers on some devices seem to rotate images in accordance with their EXIF data while others do not. Obviously this can and will result in browsers displaying images in different orientations,which can be a big issue, especially if it’s an image going on a product.

Get up to speed on orientations of images here. Yes, it’s in another language, but it’s the original source for some of the best explanation images, and at the end of the day that’s what you need to know. What do the images mean and how’s the data stored. Ultimately, there’s an exif orientation tag with a value of 1 – 8. Four for each quarter rotation, and four of the same reflected along a vertical axis.

So what’s the best thing to do? I’m focused on python, so that’s the perspective I’ll take. Firstly, I think the best option is to always output an unequivocal “production” version of the image. So the first step is to apply the modification to the image. For completeness in my solution I included the mirrored options that you don’t normally get from cameras (I mean, who knows right?). But this makes the problem worse! PIL, Wand and ImageMagick all retain the EXIF data and as yet I can’t seem to be able to write to this in any of these packages, only read.

So, my code does the following:
Load image or inherit image object from loading function
Check EXIF parameters using built in functions of your image handler (for wand it's imageobj.metadata for instance)
Get the orientation value, which will either be an int or a length one string of an int
If this isn't 1, then you need to apply the rotations
Assuming any changes are applied, then it's advisable to edit or strip the exif

What’s the best way to strip the EXIF?
First, RTFM.
I borrowed some concepts from here, but the code was from 2004 and non-functional. So I rewrote the code to strip out app1 and set a generic app0 replacing the mmap function (which is different on windows and unix) with the StringIO function. The main differences were changing resize to truncate, and building a move function which operates in the same manner as mmap’s move.

def stringio_move(stringio, destination, source, count):
stringio.seek(source)
x = stringio.read(count)
stringio.seek(destination)
stringio.write(x)
return stringio

I parse the image into the module as a blob (wand: imageobj.make_blob()) load that into a stringIO and get to work. At the end of the day, you are just stepping through the string looking for the SOI header and then the APP0 and APP1 markers. At the end of the function you write out the StringIO over the initial blob and go on your merry way.

The following packages have been kept back

If you’ve ever seen the words “The following packages have been kept back” you’ll know it can be pretty frustrating. You’ve told it to update, why isn’t it updating?

This occurs because the package has had it’s dependencies changed. It’s either going to install more or uninstall software the new version doesn’t need. A lot of replies will tell you to do a dist-upgrade.

This is a very bad idea, unless you know what you are doing. This will cause a LOT of changes to your system, and it’s not massively unusual to see it prevent a system running until you sort out a raft of post dist-upgrade issues. Now, some people will argue that you should always dist-upgrade, and deal with issues as they crop up, and while there’s merit to this, you can’t do it on a production system, especially just because you need to upgrade a package.

What’s the solution?


apt-get update
apt-get dselect-upgrade

This will then follow up with the usual explanation of which packages will be added or removed. Type in “y” like you normally do, and it will install/uninstall/upgrade your packages. Done.