Posts Tagged data journalism

Behind the story, part three: creating a map with Fusion

In the first post I explained the purpose of this three-part tutorial. I was working on a couple of things for the Hackney Post, and I did this interactive map of entertainment venues in Hackney. I wanted to get the addresses, verify they were in Hackney, and then turn them into a map. You could of course, do this all by hand, or you could semi-automate the process by using scrapers and different tools.

The tutorial covers the following:

Part one: scraping with Outwit Hub
Part two: cleaning and verifying data
Part three: creating a map with Google Fusion

Using Google Fusion

Go to your Google Drive (if you don’t have one, you can get one very easily if you use Gmail – just go to and it should explain everything there). Click on create, and then click on Fusion Table. If you don’t have that option, just click “connect more apps” and you should be able to find it easily.

Screen Shot 2014-04-01 at 09.46.52

When prompted, choose the file you saved in the last tutorial (if it’s a CSV file, ensure that “comma” is checked for separator character) and click next. It should display a preview of the information you have in the table, so click next again, and then finish – unless you want to change some of these values, which may be useful later but we don’t need to tinker with them now. It should look a little something like this, with an extra map tab to the right:

Screen Shot 2014-04-01 at 09.52.46

The highlighted cells – the “Name” values – are what Google is suggesting it should use to find the location, if you want to make a map. Of course, this is wrong, so we can change that by clicking on the arrow next to the Postcode column, then Change.. and then change the type from Text into Location. All of the Postcode column should now be highlighted, and some of the Name values will be highlighted too. Change the Name column type from Location to Text, and then Google Fusion should start geocoding.

Screen Shot 2014-04-01 at 09.57.01

This is basically the process through which Google is figuring out what the location values mean – it’s thinking about how to process the information in the postcode column, because location values can be in many different forms. When that’s done, you want to click on the Map tab, and Google should have figured out that all of the pins are in Hackney, London.

Rename your map “Art galleries in Hackney”, so we don’t forget to do it later. Then click “Change feature styles…” – this is where you can change the icon if you want to. You can use this to specify gradients depending on values, or different icons for different values. In the original map I made, you’ll see that theatres, art galleries, and museums are all different colours. Before I imported into Fusion, I added an extra column with values like large_green, large_blue and large_red into the column, one for each type – and then I specified the column in the feature styles option.

For now, we just want it to be a large green icon. So click that and then save. If you click on one of the green pins now, it should look come up with a little window, like this. We want to change that.

Screen Shot 2014-04-01 at 10.03.33

Click on “Change info window…” and you can see exactly what just came up. If you click on Custom, you can customise it to say whatever you want. I’d like to bold the name and take out the postcode bit. Move {Name} into where Name: is, and get rid of the <b> tags and Postcode: text. It should look like this:

Screen Shot 2014-04-01 at 10.06.42

Now, when you click on a pin, it should have the name in bold, and the postcode written underneath it. And we’re done. Zoom in so you can see all of the pins as closely as possible (this is how people will see it when they initially click on it, so this is important!) Now, click the arrow next to the map tab, and Publish. You’ll need to change permissions to “anyone with a link” to be able to embed it and send links to people. Check that the link works, and you’re done!

Making more elaborate maps

In the original piece I did for the Hackney Post, I had the entire address, phone numbers where possible, and hyperlinks to the venue website, all contained in the info window on the map. This information I had retained from the various scrapes I did of the Independent Directory and other places, and I also scraped the first ten results for each venue from Google, using Outwit Hub. This tutorial made a really simple map that can be made quite quickly once you have the data, but there is so much more you can do with Google Fusion.


, ,

Leave a comment

Behind the story, part two: cleaning and verifying data

In the first post I explained the purpose of this three-part tutorial. I was working on a couple of things for the Hackney Post, and I did this interactive map of entertainment venues in Hackney. I wanted to get the addresses, verify they were in Hackney, and then turn them into a map. You could of course, do this all by hand, or you could semi-automate the process by using scrapers and different tools.

The tutorial covers the following:

Part one: scraping with Outwit Hub
Part two: cleaning and verifying data
Part three: creating a map with Google Fusion

Cleaning messy data

Open up the file you just saved all of your data as. The first thing you might notice is that Outwit has scraped some extra rows that don’t have addresses in them. We need to get rid of those, so click Data, then Filter – it should have put arrows on both of the columns. Click on the arrow on the address column, and click ‘Ascending’. Go down to the bottom, and you’ll see there are venue names without addresses. Highlight the rows and delete them, then clear the filter.

We now have a clean dataset we can work with.

Note: any dataset you download will be much messier than this – and even with a great scraper sometimes you need to spend a while cleaning the data.

Separating text in columns

What we want to do now is separate the address of each row, because all we need is the postcode in order to verify that it’s actually in Hackney. Looking at the address column, you’ll notice there are three chunks of the address which are separated by a comma.

Highlight the address column, and CTRL (or CMD) F. Search for “, “. That is, comma space. The space is necessary as it means there won’t be any erroneous spaces at the beginning or end of the cells. Click Replace… and type in “@”, and then click Replace all. It’ll look like this:

Screen Shot 2014-03-31 at 18.09.03

Then, click on Data and then Text to columns. Make sure ‘delimited’ is highlighted. Click next, and then where it says delimiters, put @ in the ‘other’ box. This tells it to split by the @ we just inserted into the column. Then click finish.

Most of the postcodes should be in column D but some of them are in E or even C. You now need to go through and just ensure that all of the postcodes are in the same column. Delete the other columns with the street or city in them, so you have 2 columns again; one with the name of the venue, and one with the postcode. Save it, and we’re ready to

More cleaning with Open Refine

Now you need to open Refine [download it here], and create a new project. Open the file you just saved – find it in the browser and click next. If it’s a csv file, make sure that ‘columns are separated by’ commas (CSV). The preview should show two columns exactly as they were in the spreadsheet. Click next and then ‘create project’. Now the project should be ready to edit.

First, copy the column to another column so we have our original postcode data if anything goes wrong.

Click the arrow next to ‘Address’ and then ‘Edit Column’ and click ‘Rename column’. Rename it as original. Then click the arrow again, edit column, then add column based on this column. It should be identical. Call it postcode and click ok.

‘Edit Cells’ then ‘Transform’. Like this:

Type in: replace(value," ","")

This will take the value ” ” (space) and replace it with nothing, thereby effectively removing the space. Replace is a good command to know.

Using Open Refine to scrape

Now, the aim of this part of the work is to validate the postcodes by using a website which looks up postcodes and brings back information about them. There are a few, but I quite like this particular one called Planning Finder – it’s really useful for getting information about postcodes and planning projects in the area. It automatically comes up with the nearest planning project to the postcode, so we know that the first value it returns is either the exact postcode or very near.

We will validate our postcodes two-fold. Firstly, if the postcode isn’t recognised at all, then we’ll ignore it. And secondly, we’ll find out what area the postcode is in, and if it isn’t in Hackney, then we know we don’t want it. Click the arrow on the postcode column again, edit column, and then ‘Add column by fetching URLs’.

The expression is: ""+value

You’re telling it to go to the website string, plus the value of whatever is in that cell – the postcodes – and to then download everything on that page into the cell. You will end up with the HTML of the entire page, so it’ll take a long time – time to grab a cup of tea!

When it’s done fetching everything, you need to transpose the column. Arrow > Edit cells > Transpose. This is the expression you need: value.parseHtml().select("")[0].select("")[3].toString()

It’ll parse the HTML, selecting the first table with the class “search-results”, and the fourth cell in the row (0 in coding is always the first), with the class “info”. This cell contains the borough information – you should see that in the preview. Then the toString() will convert the response into a string – if you don’t do this, it will simply fail.

You should now have something that looks like this – you can see immediately that some of them aren’t in Hackney.

Screen Shot 2014-03-31 at 21.55.51

So, let’s export that and work on that in Excel. Export it as a CSV or Excel document (top right) and open it up.

Do the same as we did before with find and replace – highlight the whole of column PF, find <td class="info"> and replace with nothing, and find </td> and replace it with nothing. Delete the column Postcode, because that doesn’t have any spaces in it. Then you want to get rid of all of the non-Hackney postcodes.

So click on the top row, go to Data, then Filter. Click on the arrow next to the PF column, and untick the two Hackney boroughs – you should now have all of the non-Hackney rows displayed. Select them all (click on the top row and drag down – do not CTRL/CMD-A as this won’t work) and then delete the rows. Clear the filter, and you should have 61 rows (60 venues, and the title row) with exactly what you want in them. Delete the column PF, as we don’t need that anymore, save it, and then you’re done!

In the next (and last) tutorial, we’ll look at making a map with Google Fusion.

, , , ,

Leave a comment

Behind the story, part one: scraping with Outwit

Recently I was working on a couple of things for the Hackney Post, and I did this interactive map of entertainment venues in Hackney. I wanted to get the addresses, verify they were in Hackney, and then turn them into a map. You could of course, do this all by hand, or you could semi-automate the process by using scrapers and different tools.

The tutorial covers the following:

Part one: scraping with Outwit Hub
Part two: cleaning and verifying data
Part three: creating a map with Google Fusion

Scraping a single page

I used several different sources, but for the sake of brevity, the page we’re going to scrape from is

Open up Outwit Hub [download or add the Firefox extension here], paste the URL in the address bar (like a normal browser) and make sure you’re on the “Page” tab at the very top. Find the first instance you want to get. In this case, we want to find the venue name, and the address. The first one is “Fortescue Avenue”.

We’re ready to scrape. Click on the ‘scrapers’ tab under ‘automators’ and click ‘New’. Call your new scraper IndyDirectory – it should look like so:

Then CTRL-F (or CMD-F if you’re on a Mac) and start typing in Fortescue Avenue, because that’s what we need to find. It’ll scroll down to where it’s found – this is the first instance of it.

You’ll see there are actually three places where Fortescue Avenue is written. One is in the HTML telling browsers it’s a link (after the <a href tag), one is the text link for the URL (in between the opening and closing <a> tags), and one is part of the address. We’ll get the address in a minute. The most important thing is to try and find a pattern before and after the text you need, which is repeated down all of the venue names, but which won’t pick up extra information you don’t want.

You can see that the text before the first instance is probably the one we want. It starts title=" and ends - art galleries in Hackney. This is perfect. Now put that into your marker before and marker after boxes, and name the column ‘Name’. Each row in the scraper is a new column in your eventual spreadsheet. See below, that’s what you should now have in the bottom window.

Screen Shot 2014-03-31 at 17.26.12

Now onto the address. We only need the postcode but we can sort that out later in Excel. This bit is even easier. Each address (scroll down to double check) has a <p class="adr"> before it, and a </p> after. Put those values in the ‘marker before/after’ boxes, and you’ve finished the scraper.

Now, check it works. Click “Execute” at the top of the bottom window.

It should have come up with 20 rows of neat venue names and addresses, like this:

Screen Shot 2014-03-31 at 17.32.15

Now, this isn’t the end of the directory – there are over 100 items in the list on the Independent directory, spread out over a few pages. You could go into each page and do it as we have just scraped the first one, but why do that when you can also automate this process too?

Scraping from multiple pages

If you go back to your normal browser and click on a different page at the bottom of the directory list, you’ll notice in the address bar that the difference between the pages is a number appended to the original URL, and it goes up to 7. Click on page 2 and copy the URL in the address bar.

Open Excel and open a new workbook. Paste the URL for page 2 into cell A2. Copy it into cell A1 and change the 2 at the end to a 1, because the first page must be that URL. Then highlight both cells and drag down until Excel automatically fills in cell A7 with the URL plus the 7. It should look like this:

Screen Shot 2014-03-31 at 17.37.52

**Make sure that the column is wider than the text in the column, otherwise the next part won’t work!

Next, save it as a web page. It depends which version of Excel you have, but it is either File > Save as Web Page, or File > Save as… and then you can choose web page from the drop-down. Call it something like “INDYURLs”.

Go to Outwit again, click File > Open… and choose the file you just saved. It will come up with an error but that’s ok. What you need to do to access it, is click on Tables under Data on the right. It should look like this:

Screen Shot 2014-03-31 at 17.42.33

Select all of the rows (click on one and then Ctrl/Cmd-A), then right-click, go to Auto-explore, then Fast Scrape, then pick IndyDirectory. Within a few seconds, it should have finished scraping, and tell you that you have 139 scraped rows (more than we need, but never mind! We’ll sort that in the next tutorial)

Select all, and then right-click, export as… I usually export in either CSV or Excel – CSV is the safest one to go for, as it’s more basic and doesn’t have any formatting in it.

There’s your first scraper done! In the next tutorial, we’ll clean up the data and verify the postcodes.

, ,

Leave a comment

2012 mayoral & assembly election turnout by borough

While I was looking for some datasets a few days ago, I stumbled upon a goldmine of data for London, which also included boundaries in shape files. This time, I wanted to map some very simple data. (Last time I was trying to do something so complicated that I couldn’t finish it).

It took me a little bit of tinkering (where, for example, names like Kensington & Chelsea weren’t matched to Kensington and Chelsea) and some thinking to figure out what to do with the colours. The lowest election turnout in 2012 was Barking and Dagenham with 27.7% coming out to vote, whereas the highest was only 48.5% in the City of London.

Naturally, this means there isn’t all that much difference in some of the boroughs – so instead of colouring by bucket (as is sometimes useful), I decided to pick the colours based on a gradient going from red, through orange, through yellow, to green.

It’s not all that interesting, really, but here it is: Mayoral & Assembly elections 2012. Questions, pointers, etc gratefully received @SophieWarnes.

, , , ,

Leave a comment

Playing with Google Fusion – prison overcrowding

I saw a past student’s blogpost on Friday about using GF to create chloropleth maps, and we didn’t do a data session on Friday, so I thought I would give it a go and try and learn something by myself over the weekend.

It started out well. Having decided to throw caution to the wind and abandon tutorials I’d seen online that used US boundaries and US data (all far easier to find, better-documented, etc), I found a dataset I was interested in – England & Wales prison population numbers in the last month, ie September 2013. And I figured out what I wanted to do. To my total lack of surprise, it turns out a fair few prisons are currently overcrowded.

Wrong format, stupid

Now, helpfully (not), it was in a word format and it took me just a few minutes to learn something that I genuinely have been irritated about for some time and never knew was possible. How do you change a word document into an excel document?

Here you go:

  1. Open Word document
  2. File -> Save as… HTML
  3. Open Excel
  4. File -> Open the file you just saved

Ridiculously simple.

Gathering data

So I had the data in the right format, but I wanted to show it visually as ‘by county’ not by prison (to be honest, this was a terrible piece of decision-making because it serves zero purpose, but I wanted to see if I could do it. Now, after spending hours on it, I’ve decided to go back to where I was at the start – more on that in a minute.)

I scraped the table about where the prisons are located in England and Wales from Wikipedia, using a Google Spreadsheet and using the importHTML function. Then I merged the two tables together in Excel in a really complicated and time-consuming way, which I probably could have done automatically. [Copied & pasted in location data as extra column, deleted ones that didn’t match up…it took me a while]. I then added up all the figures for each county, then calculated the % of ‘operational capacity’ taken up.

But what does it all mean?

The column headings in the prison dataset aren’t immediately obvious. I had to do some searching to figure out what everything meant, and out of five or six columns, decide which columns I was going to use… Or, which columns are the most accurate to use, to find out the extent of overcrowding in prisons. There’s no point in using ‘operational capacity’ against CNA because that doesn’t take into account cells that can’t currently be used, eg. So I guess my point here, is make sure you know what figures you need to use.

Finding the borders

I must have spent about three to four hours just trying to find any way of getting what I wanted – county border data in a format that could be geocoded by Google Fusion. I cannot believe that there is not a file out there containing the data I need. I refuse to believe it. But I did try everything I could think of, and I’m still not there. I tried downloading shapefiles, converting those to KMLs, I tried importing KMLs directly, I tried merging other tables with geocoded data in… ARGH. Nothing works. I have found the files I thought I could use but I…couldn’t. I don’t know why.

When I eventually figure this all out, I’m going to make sure I keep a copy for future use. No way I’m letting it go!


Having failed in my quest to find borders that work with the county names I have, I have fully given up with the idea of doing it by border. And in any case, it’s not really a true picture of the prison situation… Some areas have several prisons, some have one…plus, as they are all added together it doesn’t make sense.

But this is as far as I’ve got with my map:

Prison Population by County Map

Remember that this is *all prisons in counties*, NOT individual prisons. The map tells you where there’s overcrowding in prison/s. The markers are as follows:

  • Green: 0-50% full
  • Yellow:  50%-100%
  • Red dots: 100-150%
  • Red markers: 150%+

So you can see that the total capacity for Swansea & Bedfordshire have been exceeded, by 50% or over. As it happens, both counties only have one prison so those two are genuinely quite bad. Although the numbers are fairly small (we’re talking around 400 prisoners)- but they are still over capacity.

Lessons learned..

  • Make sure you figure out the right way to visualise something before you embark on something that will take you hours to do
  • Use something easy
  • I need to figure out how to get borders to work

What next?

I’m not happy that the info is *markers* by county (so they’re super vague) instead of boundary-lines with colour, but I think the best thing for me to do is to take the data back to how it was, scrape the actual prison addresses from somewhere, and plot them exactly where they are. I think that will give a much clearer picture of what’s going on. When I’ve done that I’ll publish and link it properly so you can have a play around.

, , ,

Leave a comment