I saw a past student’s blogpost on Friday about using GF to create chloropleth maps, and we didn’t do a data session on Friday, so I thought I would give it a go and try and learn something by myself over the weekend.
It started out well. Having decided to throw caution to the wind and abandon tutorials I’d seen online that used US boundaries and US data (all far easier to find, better-documented, etc), I found a dataset I was interested in – England & Wales prison population numbers in the last month, ie September 2013. And I figured out what I wanted to do. To my total lack of surprise, it turns out a fair few prisons are currently overcrowded.
Wrong format, stupid
Now, helpfully (not), it was in a word format and it took me just a few minutes to learn something that I genuinely have been irritated about for some time and never knew was possible. How do you change a word document into an excel document?
Here you go:
- Open Word document
- File -> Save as… HTML
- Open Excel
- File -> Open the file you just saved
So I had the data in the right format, but I wanted to show it visually as ‘by county’ not by prison (to be honest, this was a terrible piece of decision-making because it serves zero purpose, but I wanted to see if I could do it. Now, after spending hours on it, I’ve decided to go back to where I was at the start – more on that in a minute.)
I scraped the table about where the prisons are located in England and Wales from Wikipedia, using a Google Spreadsheet and using the importHTML function. Then I merged the two tables together in Excel in a really complicated and time-consuming way, which I probably could have done automatically. [Copied & pasted in location data as extra column, deleted ones that didn’t match up…it took me a while]. I then added up all the figures for each county, then calculated the % of ‘operational capacity’ taken up.
But what does it all mean?
The column headings in the prison dataset aren’t immediately obvious. I had to do some searching to figure out what everything meant, and out of five or six columns, decide which columns I was going to use… Or, which columns are the most accurate to use, to find out the extent of overcrowding in prisons. There’s no point in using ‘operational capacity’ against CNA because that doesn’t take into account cells that can’t currently be used, eg. So I guess my point here, is make sure you know what figures you need to use.
Finding the borders
I must have spent about three to four hours just trying to find any way of getting what I wanted – county border data in a format that could be geocoded by Google Fusion. I cannot believe that there is not a file out there containing the data I need. I refuse to believe it. But I did try everything I could think of, and I’m still not there. I tried downloading shapefiles, converting those to KMLs, I tried importing KMLs directly, I tried merging other tables with geocoded data in… ARGH. Nothing works. I have found the files I thought I could use but I…couldn’t. I don’t know why.
When I eventually figure this all out, I’m going to make sure I keep a copy for future use. No way I’m letting it go!
Having failed in my quest to find borders that work with the county names I have, I have fully given up with the idea of doing it by border. And in any case, it’s not really a true picture of the prison situation… Some areas have several prisons, some have one…plus, as they are all added together it doesn’t make sense.
But this is as far as I’ve got with my map:
Remember that this is *all prisons in counties*, NOT individual prisons. The map tells you where there’s overcrowding in prison/s. The markers are as follows:
- Green: 0-50% full
- Yellow: 50%-100%
- Red dots: 100-150%
- Red markers: 150%+
So you can see that the total capacity for Swansea & Bedfordshire have been exceeded, by 50% or over. As it happens, both counties only have one prison so those two are genuinely quite bad. Although the numbers are fairly small (we’re talking around 400 prisoners)- but they are still over capacity.
- Make sure you figure out the right way to visualise something before you embark on something that will take you hours to do
- Use something easy
- I need to figure out how to get borders to work
I’m not happy that the info is *markers* by county (so they’re super vague) instead of boundary-lines with colour, but I think the best thing for me to do is to take the data back to how it was, scrape the actual prison addresses from somewhere, and plot them exactly where they are. I think that will give a much clearer picture of what’s going on. When I’ve done that I’ll publish and link it properly so you can have a play around.