Behind the story, part three: creating a map with Fusion

In the first post I explained the purpose of this three-part tutorial. I was working on a couple of things for the Hackney Post, and I did this interactive map of entertainment venues in Hackney. I wanted to get the addresses, verify they were in Hackney, and then turn them into a map. You could of course, do this all by hand, or you could semi-automate the process by using scrapers and different tools.

The tutorial covers the following:

Part one: scraping with Outwit Hub
Part two: cleaning and verifying data
Part three: creating a map with Google Fusion

Using Google Fusion

Go to your Google Drive (if you don’t have one, you can get one very easily if you use Gmail – just go to and it should explain everything there). Click on create, and then click on Fusion Table. If you don’t have that option, just click “connect more apps” and you should be able to find it easily.

Screen Shot 2014-04-01 at 09.46.52

When prompted, choose the file you saved in the last tutorial (if it’s a CSV file, ensure that “comma” is checked for separator character) and click next. It should display a preview of the information you have in the table, so click next again, and then finish – unless you want to change some of these values, which may be useful later but we don’t need to tinker with them now. It should look a little something like this, with an extra map tab to the right:

Screen Shot 2014-04-01 at 09.52.46

The highlighted cells – the “Name” values – are what Google is suggesting it should use to find the location, if you want to make a map. Of course, this is wrong, so we can change that by clicking on the arrow next to the Postcode column, then Change.. and then change the type from Text into Location. All of the Postcode column should now be highlighted, and some of the Name values will be highlighted too. Change the Name column type from Location to Text, and then Google Fusion should start geocoding.

Screen Shot 2014-04-01 at 09.57.01

This is basically the process through which Google is figuring out what the location values mean – it’s thinking about how to process the information in the postcode column, because location values can be in many different forms. When that’s done, you want to click on the Map tab, and Google should have figured out that all of the pins are in Hackney, London.

Rename your map “Art galleries in Hackney”, so we don’t forget to do it later. Then click “Change feature styles…” – this is where you can change the icon if you want to. You can use this to specify gradients depending on values, or different icons for different values. In the original map I made, you’ll see that theatres, art galleries, and museums are all different colours. Before I imported into Fusion, I added an extra column with values like large_green, large_blue and large_red into the column, one for each type – and then I specified the column in the feature styles option.

For now, we just want it to be a large green icon. So click that and then save. If you click on one of the green pins now, it should look come up with a little window, like this. We want to change that.

Screen Shot 2014-04-01 at 10.03.33

Click on “Change info window…” and you can see exactly what just came up. If you click on Custom, you can customise it to say whatever you want. I’d like to bold the name and take out the postcode bit. Move {Name} into where Name: is, and get rid of the <b> tags and Postcode: text. It should look like this:

Screen Shot 2014-04-01 at 10.06.42

Now, when you click on a pin, it should have the name in bold, and the postcode written underneath it. And we’re done. Zoom in so you can see all of the pins as closely as possible (this is how people will see it when they initially click on it, so this is important!) Now, click the arrow next to the map tab, and Publish. You’ll need to change permissions to “anyone with a link” to be able to embed it and send links to people. Check that the link works, and you’re done!

Making more elaborate maps

In the original piece I did for the Hackney Post, I had the entire address, phone numbers where possible, and hyperlinks to the venue website, all contained in the info window on the map. This information I had retained from the various scrapes I did of the Independent Directory and other places, and I also scraped the first ten results for each venue from Google, using Outwit Hub. This tutorial made a really simple map that can be made quite quickly once you have the data, but there is so much more you can do with Google Fusion.

, ,

Leave a comment

Behind the story, part two: cleaning and verifying data

In the first post I explained the purpose of this three-part tutorial. I was working on a couple of things for the Hackney Post, and I did this interactive map of entertainment venues in Hackney. I wanted to get the addresses, verify they were in Hackney, and then turn them into a map. You could of course, do this all by hand, or you could semi-automate the process by using scrapers and different tools.

The tutorial covers the following:

Part one: scraping with Outwit Hub
Part two: cleaning and verifying data
Part three: creating a map with Google Fusion

Cleaning messy data

Open up the file you just saved all of your data as. The first thing you might notice is that Outwit has scraped some extra rows that don’t have addresses in them. We need to get rid of those, so click Data, then Filter – it should have put arrows on both of the columns. Click on the arrow on the address column, and click ‘Ascending’. Go down to the bottom, and you’ll see there are venue names without addresses. Highlight the rows and delete them, then clear the filter.

We now have a clean dataset we can work with.

Note: any dataset you download will be much messier than this – and even with a great scraper sometimes you need to spend a while cleaning the data.

Separating text in columns

What we want to do now is separate the address of each row, because all we need is the postcode in order to verify that it’s actually in Hackney. Looking at the address column, you’ll notice there are three chunks of the address which are separated by a comma.

Highlight the address column, and CTRL (or CMD) F. Search for “, “. That is, comma space. The space is necessary as it means there won’t be any erroneous spaces at the beginning or end of the cells. Click Replace… and type in “@”, and then click Replace all. It’ll look like this:

Screen Shot 2014-03-31 at 18.09.03

Then, click on Data and then Text to columns. Make sure ‘delimited’ is highlighted. Click next, and then where it says delimiters, put @ in the ‘other’ box. This tells it to split by the @ we just inserted into the column. Then click finish.

Most of the postcodes should be in column D but some of them are in E or even C. You now need to go through and just ensure that all of the postcodes are in the same column. Delete the other columns with the street or city in them, so you have 2 columns again; one with the name of the venue, and one with the postcode. Save it, and we’re ready to

More cleaning with Open Refine

Now you need to open Refine [download it here], and create a new project. Open the file you just saved – find it in the browser and click next. If it’s a csv file, make sure that ‘columns are separated by’ commas (CSV). The preview should show two columns exactly as they were in the spreadsheet. Click next and then ‘create project’. Now the project should be ready to edit.

First, copy the column to another column so we have our original postcode data if anything goes wrong.

Click the arrow next to ‘Address’ and then ‘Edit Column’ and click ‘Rename column’. Rename it as original. Then click the arrow again, edit column, then add column based on this column. It should be identical. Call it postcode and click ok.

‘Edit Cells’ then ‘Transform’. Like this:

Type in: replace(value," ","")

This will take the value ” ” (space) and replace it with nothing, thereby effectively removing the space. Replace is a good command to know.

Using Open Refine to scrape

Now, the aim of this part of the work is to validate the postcodes by using a website which looks up postcodes and brings back information about them. There are a few, but I quite like this particular one called Planning Finder – it’s really useful for getting information about postcodes and planning projects in the area. It automatically comes up with the nearest planning project to the postcode, so we know that the first value it returns is either the exact postcode or very near.

We will validate our postcodes two-fold. Firstly, if the postcode isn’t recognised at all, then we’ll ignore it. And secondly, we’ll find out what area the postcode is in, and if it isn’t in Hackney, then we know we don’t want it. Click the arrow on the postcode column again, edit column, and then ‘Add column by fetching URLs’.

The expression is: ""+value

You’re telling it to go to the website string, plus the value of whatever is in that cell – the postcodes – and to then download everything on that page into the cell. You will end up with the HTML of the entire page, so it’ll take a long time – time to grab a cup of tea!

When it’s done fetching everything, you need to transpose the column. Arrow > Edit cells > Transpose. This is the expression you need: value.parseHtml().select("")[0].select("")[3].toString()

It’ll parse the HTML, selecting the first table with the class “search-results”, and the fourth cell in the row (0 in coding is always the first), with the class “info”. This cell contains the borough information – you should see that in the preview. Then the toString() will convert the response into a string – if you don’t do this, it will simply fail.

You should now have something that looks like this – you can see immediately that some of them aren’t in Hackney.

Screen Shot 2014-03-31 at 21.55.51

So, let’s export that and work on that in Excel. Export it as a CSV or Excel document (top right) and open it up.

Do the same as we did before with find and replace – highlight the whole of column PF, find <td class="info"> and replace with nothing, and find </td> and replace it with nothing. Delete the column Postcode, because that doesn’t have any spaces in it. Then you want to get rid of all of the non-Hackney postcodes.

So click on the top row, go to Data, then Filter. Click on the arrow next to the PF column, and untick the two Hackney boroughs – you should now have all of the non-Hackney rows displayed. Select them all (click on the top row and drag down – do not CTRL/CMD-A as this won’t work) and then delete the rows. Clear the filter, and you should have 61 rows (60 venues, and the title row) with exactly what you want in them. Delete the column PF, as we don’t need that anymore, save it, and then you’re done!

In the next (and last) tutorial, we’ll look at making a map with Google Fusion.

, , , ,

Leave a comment

Behind the story, part one: scraping with Outwit

Recently I was working on a couple of things for the Hackney Post, and I did this interactive map of entertainment venues in Hackney. I wanted to get the addresses, verify they were in Hackney, and then turn them into a map. You could of course, do this all by hand, or you could semi-automate the process by using scrapers and different tools.

The tutorial covers the following:

Part one: scraping with Outwit Hub
Part two: cleaning and verifying data
Part three: creating a map with Google Fusion

Scraping a single page

I used several different sources, but for the sake of brevity, the page we’re going to scrape from is

Open up Outwit Hub [download or add the Firefox extension here], paste the URL in the address bar (like a normal browser) and make sure you’re on the “Page” tab at the very top. Find the first instance you want to get. In this case, we want to find the venue name, and the address. The first one is “Fortescue Avenue”.

We’re ready to scrape. Click on the ‘scrapers’ tab under ‘automators’ and click ‘New’. Call your new scraper IndyDirectory – it should look like so:

Then CTRL-F (or CMD-F if you’re on a Mac) and start typing in Fortescue Avenue, because that’s what we need to find. It’ll scroll down to where it’s found – this is the first instance of it.

You’ll see there are actually three places where Fortescue Avenue is written. One is in the HTML telling browsers it’s a link (after the <a href tag), one is the text link for the URL (in between the opening and closing <a> tags), and one is part of the address. We’ll get the address in a minute. The most important thing is to try and find a pattern before and after the text you need, which is repeated down all of the venue names, but which won’t pick up extra information you don’t want.

You can see that the text before the first instance is probably the one we want. It starts title=" and ends - art galleries in Hackney. This is perfect. Now put that into your marker before and marker after boxes, and name the column ‘Name’. Each row in the scraper is a new column in your eventual spreadsheet. See below, that’s what you should now have in the bottom window.

Screen Shot 2014-03-31 at 17.26.12

Now onto the address. We only need the postcode but we can sort that out later in Excel. This bit is even easier. Each address (scroll down to double check) has a <p class="adr"> before it, and a </p> after. Put those values in the ‘marker before/after’ boxes, and you’ve finished the scraper.

Now, check it works. Click “Execute” at the top of the bottom window.

It should have come up with 20 rows of neat venue names and addresses, like this:

Screen Shot 2014-03-31 at 17.32.15

Now, this isn’t the end of the directory – there are over 100 items in the list on the Independent directory, spread out over a few pages. You could go into each page and do it as we have just scraped the first one, but why do that when you can also automate this process too?

Scraping from multiple pages

If you go back to your normal browser and click on a different page at the bottom of the directory list, you’ll notice in the address bar that the difference between the pages is a number appended to the original URL, and it goes up to 7. Click on page 2 and copy the URL in the address bar.

Open Excel and open a new workbook. Paste the URL for page 2 into cell A2. Copy it into cell A1 and change the 2 at the end to a 1, because the first page must be that URL. Then highlight both cells and drag down until Excel automatically fills in cell A7 with the URL plus the 7. It should look like this:

Screen Shot 2014-03-31 at 17.37.52

**Make sure that the column is wider than the text in the column, otherwise the next part won’t work!

Next, save it as a web page. It depends which version of Excel you have, but it is either File > Save as Web Page, or File > Save as… and then you can choose web page from the drop-down. Call it something like “INDYURLs”.

Go to Outwit again, click File > Open… and choose the file you just saved. It will come up with an error but that’s ok. What you need to do to access it, is click on Tables under Data on the right. It should look like this:

Screen Shot 2014-03-31 at 17.42.33

Select all of the rows (click on one and then Ctrl/Cmd-A), then right-click, go to Auto-explore, then Fast Scrape, then pick IndyDirectory. Within a few seconds, it should have finished scraping, and tell you that you have 139 scraped rows (more than we need, but never mind! We’ll sort that in the next tutorial)

Select all, and then right-click, export as… I usually export in either CSV or Excel – CSV is the safest one to go for, as it’s more basic and doesn’t have any formatting in it.

There’s your first scraper done! In the next tutorial, we’ll clean up the data and verify the postcodes.

, ,

Leave a comment

The soldier suicides: Good story, done badly

The Metro had a story on the site last month about soldiers committing suicides, that I think is a really good and important story, but could have really been presented and written in a much better way. It’s here, if you want to have a look.

Firstly, we’re bombarded with figures but not really given enough space to actually make sense of them. It starts:

Every 65 minutes, a US military veteran takes his own life, the equivalent of 22 former soldiers committing suicide every day.

This would be much better if it was just written in a straight way, omitting the 65 minutes. While the “every 65 minutes” sounds more shocking, it’s a weird amount of time to get your head around. If it was 60 minutes (a unit of time we are all familiar with in our daily lives), it would have a lot more impact, but obviously you can’t just make up the numbers to suit your piece.

…more active-duty US soliders killed themselves (177) in 2012 than died in war zones (176).

Across all services, 349 military personnel committed suicide last year, compared to the 311 American troops killed in combat.

Yet more numbers that are difficult to make sense of easily. It may have been better to write it up without the numbers but show the figures in a chart, like this:

Soldier suicides

This example chart I just whipped up isn’t that great, because there are only two pieces of information in it, but the original data could have been interrogated to compare suicides with those killed in combat over a number of years. Has the number of suicides risen? Yes – the piece already explains that, but there is no way for the reader to explore that data themselves. The data has been cherry-picked, and as someone interested in figures and numbers, I’m left wondering what’s been left out.

There are some other interesting figures, too, relating to the number of suicides between 2001 and 2011 after tours to Afghanistan and Iraq. This is again, just written – it would be much better as a visualisation, easing the numbers into the piece rather than confronting the reader with so many figures that it’s hard to get a good understanding of what exactly is going on. Here’s one way they could have done it:

Soldier suicides 2

Talking of infographics and charts, the infographic included in the article has been made for the paper. It is longer than the screen (well, MY screen – on my MacBook pro) and effectively blocks the article in a way which isn’t that user-friendly. The two most useful points on this graphic are the ‘4,552 suicides by men in Britain’ (bottom right corner), and the donut chart at the top right corner, which explains that 74% of military cases assessed in 2011/12 had a mental disorder. That’s something worth charting in a more pleasing way, though, I think. Leading with a raw number (5,404 – which is the total number of military cases) doesn’t make a whole lot of sense here, whereas “nearly three quarters” or 74% is more immediately compelling.

The three soldier heads in the infographic is a bit strange – they don’t universally indicate ‘three times as likely’ (rather, I assumed it was something to do with three PEOPLE, eg three soldiers committing suicide every x hours), and the clock also doesn’t make much sense in this context, because (as I explained above) it’s a strange number that doesn’t translate well into a simple diagram, using units we are already conversant with.

Overall I think the piece could have done with more efficient and lively infographics which pick out key figures in a way which can be easily identified on the first read, rather than forcing the reader to go back through and spend time trying to fully get to grips with what’s going on. I also think that the numbers could have been sprinkled throughout the piece in a more subtle way – again, to save the reader the work of having to sit there and figure out which numbers mean what.

Leave a comment

How accurate are online reputation tools?

Last week, we were introduced to PeerIndex, Klout and Kred. I decided to check out what scores I had on each, and see if the information they hold about me is accurate or not.


Klout score

Klout has scored me 39 out of 100 for online influence.

It’s highlighted my topics as Journalism, Twitter, Photography, Feminism, and Activism. So far, so good. When I go to manage my topics it then suggests London, Dubai, Art, Police, and Social Media. I understand London (live there), Dubai (have friends there, have tweeted about it a lot), and social media (it’s an obvious interest of mine). Art I can also understand – I love street art – but Police? Why?

Then it throws up even weirder suggestions. Greece? The NUS? Final Cut Pro? I have perhaps tweeted about Greece a few times but am no way near expert or especially interested in it beyond the Eurozone crisis, and since starting my account I’ve not tweeted about or been involved in student politics. I haven’t used Final Cut Pro for years (though admittedly I do know how to use it) and I’ve certainly never tweeted about it. That said, it does have my LinkedIn credentials and FCP is listed as a piece of software I can use.

Do I recognise myself in there? Yes, definitely. But I do feel that it’s not wholly representative of how I present myself online or how I actually am, and I think that out of context (ie without knowing why on earth Dubai is in there or my feelings about Dubai), those interests could be misinterpreted somewhat.


Kred scoreKred has given me 778 influence and an outreach level of 8. Sounds pretty impressive, but I’m not sure how it came to that conclusion.

The communities that I am apparently involved in are quite interesting. Travel, reporters and publishing are all fairly obvious – I’ve tweeted photos etc from my trips abroad, I’m connected to a lot of reporters on Twitter, and I have ‘journalist’ in my Twitter name.

I’m not a parent, but I do follow a great many people who might have children (let’s face it, it’s unusual to find people without children when they are over a certain age, and I tend to follow people who are older than me anyway). Automotive was the one that really made me laugh, though. Aside from having a car, which I barely drive as is, I don’t have any interest in cars or the automotive industry.


Peer IndexPeerIndex seems to have me down as a bit of a bore. You can see my Top Topics to the right. I wouldn’t say they are particularly wrong, it’s just that the combination of all five paints me as a very different person than the impression you’d get from a) seeing me; b) a full list of my interests.

I’m not sure where Museums came from – when I was in New York, I gushed about how great the Met was, but I don’t actually go to museums that much, nor do I tweet about them. Spirits and cocktails also makes me sound like a connoisseur – I’m flattered, but it’s not in the least true!

PeerIndex has put my ‘benchmark topics’ as arts and entertainment (50), news and society (49) and science and environment (48). It also has sport in there (18) – would that be a remnant from when I sarcastically live-tweeted Euro 2012? I have zero interest in sport!

Lastly, influencers and influenced… According to PeerIndex I’m influenced by people whom, although I like them, have very little impact on my clicking/retweeting behaviour. And, looking at the list of people I supposedly influence, there are some faces I’ve never even seen before on there. I’m pretty good at remembering Twitter names/pictures, so it’s strange that there’s some people I’ve just never heard of. And I’m influencing them? Hm. Ok.

So what now?

Well, I think these tools can be helpful if it isn’t used as the be-all and end-all of your social media strategy, but more as a guideline as to what makes a good Twitter account. The factors that are taken into account are telling as to your ‘social media personality’ – not in a definitive, these-are-your-interests way, but personality in the way that you socialise online. For instance, Kred values ‘outreach’ – that is, how generous you are to other people – replying, mentioning, retweeting or following others. This makes perfect sense and places emphasis on the frankly undervalued and misunderstood ‘social’ part of social media; after all, you wouldn’t turn up to a party empty-handed and without saying hello to anyone!

As for which of these is ‘the best’… At a push, I would probably go with Klout, but they are all hit and miss and I suspect they lack intrinsic value in themselves, only being of genuine use in certain contexts.

, , , , , , , ,

Leave a comment

Excess Winter Deaths in England, visualised

It’s been all over the news today, so I whipped up a quick graphic about Excess Winter Deaths. Click on the picture to see the visualisation and get more information on what each bar represents. Source: Office for National Statistics.

Excess Winter Deaths in England

Leave a comment

Google searches as a prediction tool

I’m still reading through Nate Silver’s book The Signal and The Noise, which is about using data to predict certain outcomes. He’s most well-known for accurately predicting the outcome of the US presidential election, but his book discusses baseball, the weather, and earthquake predictions, amongst others.

I’m up to the point where he discusses the economy and how horribly difficult it is for economists to accurately predict how it will behave – from those who are simply overly optimistic, to those who are just wrong because they see the economy as a set of variables, rather than a story. Or those who need to predict a certain way because it suits their political agenda.

He never claims it’s easy, but it’s mentioned that trends in Google searches could probably ‘predict’ economic failures before anybody else really notices. I think that one reason for this is that financial problems or worries tend to be kept privately. People hear rumours at work that there will be a round of redundancies – and immediately they’re on Google looking for another job, not necessarily reaching out to other people directly.

What do Google searches show?

'Find a job'

‘Find a job’

It’s perhaps hard to see when the picture is small, but a short glance at the trend for ‘find a job’ is quite interesting. See the short spike upwards every so often? This seems to indicate that in January, many more people in the UK are looking for a new job. It seems to be a common New Year’s Resolution! However, in the years that we know in hindsight were problematic, there doesn’t seem to be much of a difference compared to previous years (in fact, 2004-2006 seems to be particularly rough, with far more interest in ‘find a job’ than in recent years).

'Unemployed' and 'unemployment'

‘Unemployed’ and ‘unemployment’

Here, the red line indicates searches for ‘unemployment’, and the blue for ‘unemployed’. You can see a dramatic increase in the number of searches in late 2008. In fact, the exact month that the interest in ‘unemployment’ starts to increase is September 2008, when the Lehman Brothers were declared bankrupt, suggesting that it sent shockwaves throughout the UK, as well as the US.



Now, this chart is quite interesting. It shows the trend for JSA – Job Seeker’s Allowance. Obviously it’s hard to fully understand the motivations behind searching this but I note two things. Firstly, there’s a sharp incline at the very beginning of 2011. I believe that at the end of 2010 and beginning of 2011, the government started initiating changes in the way that benefits work, leading more people to look at JSA online.

Secondly, I think that as a result of this, there is likely a general increase in interest in JSA as a whole. I wouldn’t say this is particularly indicative of much other than genuine interest in JSA itself, or the changes to JSA, rather than people searching specifically to ‘get’ JSA. Of course, if people are out of work, then they are more likely to be interested in how to get JSA, but I would be wary of directly and explicitly linking the two.

'Apply for JSA'

‘Apply for JSA’

From October 2010 to November 2010, the searches for ‘Apply for JSA’ start coming in. Two years after the Lehman collapse and the financial crisis began, but also a few months into a new coalition government. A cursory search online suggests that there were a lot of policy suggestions/ideas about changing benefits around October and November 2010 – yet, again, that doesn’t fully explain the graph we see above. Renewed interest in applying for JSA as a result of loss of jobs logically follows a financial crisis, but it doesn’t account for the complete absence of searches for ‘Apply for JSA’ prior to that. There would have been people made redundant from 2005-2010 – or at least, there is no reason I can think of to believe this is not the case.

Google as a ‘state of the economy’ indicator

These are extremely brief forays into what historical search data can reveal about how policies and economic situations impact people directly, and influence the things they search for. Could it be used to predict or see in ‘real time’ what on earth is going on with the economy, before ‘official’ indicators and experts do? I think it is possible, if you are finely tuned to how the graph should look ‘normally’, and you can recognise the warning signs and key differences. But it needs to be used very wisely indeed, with as much background/context to the statistics as possible, with an open mind as to what indicators (search terms) you should look for, and with an awareness that not everything can be fully explained in statistics.

Because as we learned earlier, ‘the economy’ is not simply a mass of variables and inputs to be analysed, but a whole range of intricate and complex stories to be examined.

, , , , , , , , , , , ,

1 Comment

2012 mayoral & assembly election turnout by borough

While I was looking for some datasets a few days ago, I stumbled upon a goldmine of data for London, which also included boundaries in shape files. This time, I wanted to map some very simple data. (Last time I was trying to do something so complicated that I couldn’t finish it).

It took me a little bit of tinkering (where, for example, names like Kensington & Chelsea weren’t matched to Kensington and Chelsea) and some thinking to figure out what to do with the colours. The lowest election turnout in 2012 was Barking and Dagenham with 27.7% coming out to vote, whereas the highest was only 48.5% in the City of London.

Naturally, this means there isn’t all that much difference in some of the boroughs – so instead of colouring by bucket (as is sometimes useful), I decided to pick the colours based on a gradient going from red, through orange, through yellow, to green.

It’s not all that interesting, really, but here it is: Mayoral & Assembly elections 2012. Questions, pointers, etc gratefully received @SophieWarnes.

, , , ,

Leave a comment

Playing with Google Fusion – prison overcrowding

I saw a past student’s blogpost on Friday about using GF to create chloropleth maps, and we didn’t do a data session on Friday, so I thought I would give it a go and try and learn something by myself over the weekend.

It started out well. Having decided to throw caution to the wind and abandon tutorials I’d seen online that used US boundaries and US data (all far easier to find, better-documented, etc), I found a dataset I was interested in – England & Wales prison population numbers in the last month, ie September 2013. And I figured out what I wanted to do. To my total lack of surprise, it turns out a fair few prisons are currently overcrowded.

Wrong format, stupid

Now, helpfully (not), it was in a word format and it took me just a few minutes to learn something that I genuinely have been irritated about for some time and never knew was possible. How do you change a word document into an excel document?

Here you go:

  1. Open Word document
  2. File -> Save as… HTML
  3. Open Excel
  4. File -> Open the file you just saved

Ridiculously simple.

Gathering data

So I had the data in the right format, but I wanted to show it visually as ‘by county’ not by prison (to be honest, this was a terrible piece of decision-making because it serves zero purpose, but I wanted to see if I could do it. Now, after spending hours on it, I’ve decided to go back to where I was at the start – more on that in a minute.)

I scraped the table about where the prisons are located in England and Wales from Wikipedia, using a Google Spreadsheet and using the importHTML function. Then I merged the two tables together in Excel in a really complicated and time-consuming way, which I probably could have done automatically. [Copied & pasted in location data as extra column, deleted ones that didn’t match up…it took me a while]. I then added up all the figures for each county, then calculated the % of ‘operational capacity’ taken up.

But what does it all mean?

The column headings in the prison dataset aren’t immediately obvious. I had to do some searching to figure out what everything meant, and out of five or six columns, decide which columns I was going to use… Or, which columns are the most accurate to use, to find out the extent of overcrowding in prisons. There’s no point in using ‘operational capacity’ against CNA because that doesn’t take into account cells that can’t currently be used, eg. So I guess my point here, is make sure you know what figures you need to use.

Finding the borders

I must have spent about three to four hours just trying to find any way of getting what I wanted – county border data in a format that could be geocoded by Google Fusion. I cannot believe that there is not a file out there containing the data I need. I refuse to believe it. But I did try everything I could think of, and I’m still not there. I tried downloading shapefiles, converting those to KMLs, I tried importing KMLs directly, I tried merging other tables with geocoded data in… ARGH. Nothing works. I have found the files I thought I could use but I…couldn’t. I don’t know why.

When I eventually figure this all out, I’m going to make sure I keep a copy for future use. No way I’m letting it go!


Having failed in my quest to find borders that work with the county names I have, I have fully given up with the idea of doing it by border. And in any case, it’s not really a true picture of the prison situation… Some areas have several prisons, some have one…plus, as they are all added together it doesn’t make sense.

But this is as far as I’ve got with my map:

Prison Population by County Map

Remember that this is *all prisons in counties*, NOT individual prisons. The map tells you where there’s overcrowding in prison/s. The markers are as follows:

  • Green: 0-50% full
  • Yellow:  50%-100%
  • Red dots: 100-150%
  • Red markers: 150%+

So you can see that the total capacity for Swansea & Bedfordshire have been exceeded, by 50% or over. As it happens, both counties only have one prison so those two are genuinely quite bad. Although the numbers are fairly small (we’re talking around 400 prisoners)- but they are still over capacity.

Lessons learned..

  • Make sure you figure out the right way to visualise something before you embark on something that will take you hours to do
  • Use something easy
  • I need to figure out how to get borders to work

What next?

I’m not happy that the info is *markers* by county (so they’re super vague) instead of boundary-lines with colour, but I think the best thing for me to do is to take the data back to how it was, scrape the actual prison addresses from somewhere, and plot them exactly where they are. I think that will give a much clearer picture of what’s going on. When I’ve done that I’ll publish and link it properly so you can have a play around.

, , ,

Leave a comment

Do we have a better way of measuring engagement online?

There are loads of tools that help you measure social media/community engagement online right now. Clearly this is something that people are becoming more and more concerned about; who is looking at, reading about, writing about our product? Who is sharing our news stories? Why?

I just can’t help but notice that so much of it is about numbers.

Klout scores

Klout assesses your engagement (supposedly) and gives you a number, ‘ranking’ you in how you influence people in your circle, ie how much you can get them to engage with things you write about or point out online. I’ll be honest; I just don’t understand the reasoning behind of pinning a number to your ‘value’ as an engaging user. What does the number 46 tell me about you? Bugger-all. Are Klout scores merely another vanity tool, a way to tell you you’re great… or a way to tell other people that you’re great?*


I use Hootsuite a lot (though I am going to move to for various reasons) to schedule, since Tweetdeck was bought by Twitter and fell off a very high cliff. I like Hootsuite – I’m a free user, and I get detailed analytics about which articles do well, where people click from, etc – but again my issue is that these are all numerical values. They don’t tell me much about the intention behind the action.

Did someone click my link by accident? Did they click it because they saw someone else retweet it? How many degrees is this person from me? Are they glad they clicked it or are they annoyed because it wasn’t what they expected or wanted? The answer to those questions would perhaps be more interesting. It would be useful to know if people click directly from my page or from other people’s retweets, because if it’s the case that people click through retweets, then that makes a retweet – through people you already know – arguably more valuable than the amount of followers you have. Similarly, it’d be much easier to ‘do social media right’ if I knew how successful I was at pointing people to what they wanted to read. What’s the point in having followers who don’t click links or engage with you?

Why use quantitative data to measure actions?

It’s easy. I get it. But really, what does having 50,000 followers tell me about someone? That they’ve been on the site for long enough, that they are a nice person, that they are useful? Perhaps. But that doesn’t mean that someone with 500 followers isn’t also all of those things.

The key thing is: engagement has intention and meaning beyond numbers. I wonder how many people click on Daily Mail links because they don’t like it and they want to hate-read. How many people tweet along with the #XFactor hashtag but love to hate it? Numbers just don’t cut it in these instances.

What about Facebook? Facebook Page Insights gave us stats about users ‘talking about’ a page, users ‘engaging with’ a page (both in terms of who has seen it, and who has liked, commented, shared). I have no idea if people are talking about my page in a positive way. In the case that they are talking negatively, there’s not a lot I can do to resolve that person’s negative perception of my page. (In fact, I have yet to find a way of actually seeing where people are talking about my page, but that’s a different issue.)

My answer to all of these questions would be that analysing engagement in a way that takes into account intention or meaning would be brilliantly useful for social media/communities editors.

Of course, this analysis is already done – by people who are employed by brands or PR companies, to monitor social media for mentions of the brand. Is there a brand with so much action on social media that they cannot cope? I don’t know. But it would be a lot easier to automate it.

I’m not arguing that current analytical tools are not useful or should not be used. Not by a long shot. Some of them are really sophisticated (and I’m aware as a free/non-corporate user, I don’t have full access to them) and great for measuring your success on social media. I would just be wary of drawing too many absolute conclusions about what those figures really mean when we have so little information about what’s behind them.

Automating qualitative analysis of engagement

People are currently trying to fix this very problem with comments on articles. In fact, recently, there was a Hack Day on ‘re-imagining comments’ which a team from The Times won. The ideas to come out of that are definitely interesting. These take into account either an extra layer of ‘moderation’ whereby communities or staff label people as useful/experts, or an analysis of the words used in the comment.

So when you comment “this is great, I agree”, it’s flagged up as a positive comment; “terrible article” is clearly a negative one. It’s definitely a move closer to the kind of thing I’m talking about, but there are some obvious issues especially when it comes to exact words that could mean something very different out of context. Instead of searching for words or phrases in isolation, the whole sentence/paragraph needs to be contextually analysed. Now it gets difficult… right?

I like coding now and again, but I’ve never touched social media APIs and I wouldn’t have a clue about where to start. Maybe the next Hack Day could be about re-imagining analytical tools?

If I’ve missed an important analytical tool which does take into account semantics, feel free to tweet or leave a comment. I’m very interested in trying it out!

*If you’ve found a use for Klout, again…tweet me, because I cannot for the life of me understand how it’s useful

, , , , ,

1 Comment