Posts Tagged cleaning data

Behind the story, part two: cleaning and verifying data

In the first post I explained the purpose of this three-part tutorial. I was working on a couple of things for the Hackney Post, and I did this interactive map of entertainment venues in Hackney. I wanted to get the addresses, verify they were in Hackney, and then turn them into a map. You could of course, do this all by hand, or you could semi-automate the process by using scrapers and different tools.

The tutorial covers the following:

Part one: scraping with Outwit Hub
Part two: cleaning and verifying data
Part three: creating a map with Google Fusion

Cleaning messy data

Open up the file you just saved all of your data as. The first thing you might notice is that Outwit has scraped some extra rows that don’t have addresses in them. We need to get rid of those, so click Data, then Filter – it should have put arrows on both of the columns. Click on the arrow on the address column, and click ‘Ascending’. Go down to the bottom, and you’ll see there are venue names without addresses. Highlight the rows and delete them, then clear the filter.

We now have a clean dataset we can work with.

Note: any dataset you download will be much messier than this – and even with a great scraper sometimes you need to spend a while cleaning the data.

Separating text in columns

What we want to do now is separate the address of each row, because all we need is the postcode in order to verify that it’s actually in Hackney. Looking at the address column, you’ll notice there are three chunks of the address which are separated by a comma.

Highlight the address column, and CTRL (or CMD) F. Search for “, “. That is, comma space. The space is necessary as it means there won’t be any erroneous spaces at the beginning or end of the cells. Click Replace… and type in “@”, and then click Replace all. It’ll look like this:

Screen Shot 2014-03-31 at 18.09.03

Then, click on Data and then Text to columns. Make sure ‘delimited’ is highlighted. Click next, and then where it says delimiters, put @ in the ‘other’ box. This tells it to split by the @ we just inserted into the column. Then click finish.

Most of the postcodes should be in column D but some of them are in E or even C. You now need to go through and just ensure that all of the postcodes are in the same column. Delete the other columns with the street or city in them, so you have 2 columns again; one with the name of the venue, and one with the postcode. Save it, and we’re ready to

More cleaning with Open Refine

Now you need to open Refine [download it here], and create a new project. Open the file you just saved – find it in the browser and click next. If it’s a csv file, make sure that ‘columns are separated by’ commas (CSV). The preview should show two columns exactly as they were in the spreadsheet. Click next and then ‘create project’. Now the project should be ready to edit.

First, copy the column to another column so we have our original postcode data if anything goes wrong.

Click the arrow next to ‘Address’ and then ‘Edit Column’ and click ‘Rename column’. Rename it as original. Then click the arrow again, edit column, then add column based on this column. It should be identical. Call it postcode and click ok.

‘Edit Cells’ then ‘Transform’. Like this:

Type in: replace(value," ","")

This will take the value ” ” (space) and replace it with nothing, thereby effectively removing the space. Replace is a good command to know.

Using Open Refine to scrape

Now, the aim of this part of the work is to validate the postcodes by using a website which looks up postcodes and brings back information about them. There are a few, but I quite like this particular one called Planning Finder – it’s really useful for getting information about postcodes and planning projects in the area. It automatically comes up with the nearest planning project to the postcode, so we know that the first value it returns is either the exact postcode or very near.

We will validate our postcodes two-fold. Firstly, if the postcode isn’t recognised at all, then we’ll ignore it. And secondly, we’ll find out what area the postcode is in, and if it isn’t in Hackney, then we know we don’t want it. Click the arrow on the postcode column again, edit column, and then ‘Add column by fetching URLs’.

The expression is: "http://www.planningfinder.co.uk/search/near?postcode="+value

You’re telling it to go to the website string, plus the value of whatever is in that cell – the postcodes – and to then download everything on that page into the cell. You will end up with the HTML of the entire page, so it’ll take a long time – time to grab a cup of tea!

When it’s done fetching everything, you need to transpose the column. Arrow > Edit cells > Transpose. This is the expression you need: value.parseHtml().select("table.search-results")[0].select("td.info")[3].toString()

It’ll parse the HTML, selecting the first table with the class “search-results”, and the fourth cell in the row (0 in coding is always the first), with the class “info”. This cell contains the borough information – you should see that in the preview. Then the toString() will convert the response into a string – if you don’t do this, it will simply fail.

You should now have something that looks like this – you can see immediately that some of them aren’t in Hackney.

Screen Shot 2014-03-31 at 21.55.51

So, let’s export that and work on that in Excel. Export it as a CSV or Excel document (top right) and open it up.

Do the same as we did before with find and replace – highlight the whole of column PF, find <td class="info"> and replace with nothing, and find </td> and replace it with nothing. Delete the column Postcode, because that doesn’t have any spaces in it. Then you want to get rid of all of the non-Hackney postcodes.

So click on the top row, go to Data, then Filter. Click on the arrow next to the PF column, and untick the two Hackney boroughs – you should now have all of the non-Hackney rows displayed. Select them all (click on the top row and drag down – do not CTRL/CMD-A as this won’t work) and then delete the rows. Clear the filter, and you should have 61 rows (60 venues, and the title row) with exactly what you want in them. Delete the column PF, as we don’t need that anymore, save it, and then you’re done!

In the next (and last) tutorial, we’ll look at making a map with Google Fusion.

Advertisements

, , , ,

Leave a comment