Posts Tagged scraping

Behind the story, part one: scraping with Outwit

Recently I was working on a couple of things for the Hackney Post, and I did this interactive map of entertainment venues in Hackney. I wanted to get the addresses, verify they were in Hackney, and then turn them into a map. You could of course, do this all by hand, or you could semi-automate the process by using scrapers and different tools.

The tutorial covers the following:

Part one: scraping with Outwit Hub
Part two: cleaning and verifying data
Part three: creating a map with Google Fusion

Scraping a single page

I used several different sources, but for the sake of brevity, the page we’re going to scrape from is

Open up Outwit Hub [download or add the Firefox extension here], paste the URL in the address bar (like a normal browser) and make sure you’re on the “Page” tab at the very top. Find the first instance you want to get. In this case, we want to find the venue name, and the address. The first one is “Fortescue Avenue”.

We’re ready to scrape. Click on the ‘scrapers’ tab under ‘automators’ and click ‘New’. Call your new scraper IndyDirectory – it should look like so:

Then CTRL-F (or CMD-F if you’re on a Mac) and start typing in Fortescue Avenue, because that’s what we need to find. It’ll scroll down to where it’s found – this is the first instance of it.

You’ll see there are actually three places where Fortescue Avenue is written. One is in the HTML telling browsers it’s a link (after the <a href tag), one is the text link for the URL (in between the opening and closing <a> tags), and one is part of the address. We’ll get the address in a minute. The most important thing is to try and find a pattern before and after the text you need, which is repeated down all of the venue names, but which won’t pick up extra information you don’t want.

You can see that the text before the first instance is probably the one we want. It starts title=" and ends - art galleries in Hackney. This is perfect. Now put that into your marker before and marker after boxes, and name the column ‘Name’. Each row in the scraper is a new column in your eventual spreadsheet. See below, that’s what you should now have in the bottom window.

Screen Shot 2014-03-31 at 17.26.12

Now onto the address. We only need the postcode but we can sort that out later in Excel. This bit is even easier. Each address (scroll down to double check) has a <p class="adr"> before it, and a </p> after. Put those values in the ‘marker before/after’ boxes, and you’ve finished the scraper.

Now, check it works. Click “Execute” at the top of the bottom window.

It should have come up with 20 rows of neat venue names and addresses, like this:

Screen Shot 2014-03-31 at 17.32.15

Now, this isn’t the end of the directory – there are over 100 items in the list on the Independent directory, spread out over a few pages. You could go into each page and do it as we have just scraped the first one, but why do that when you can also automate this process too?

Scraping from multiple pages

If you go back to your normal browser and click on a different page at the bottom of the directory list, you’ll notice in the address bar that the difference between the pages is a number appended to the original URL, and it goes up to 7. Click on page 2 and copy the URL in the address bar.

Open Excel and open a new workbook. Paste the URL for page 2 into cell A2. Copy it into cell A1 and change the 2 at the end to a 1, because the first page must be that URL. Then highlight both cells and drag down until Excel automatically fills in cell A7 with the URL plus the 7. It should look like this:

Screen Shot 2014-03-31 at 17.37.52

**Make sure that the column is wider than the text in the column, otherwise the next part won’t work!

Next, save it as a web page. It depends which version of Excel you have, but it is either File > Save as Web Page, or File > Save as… and then you can choose web page from the drop-down. Call it something like “INDYURLs”.

Go to Outwit again, click File > Open… and choose the file you just saved. It will come up with an error but that’s ok. What you need to do to access it, is click on Tables under Data on the right. It should look like this:

Screen Shot 2014-03-31 at 17.42.33

Select all of the rows (click on one and then Ctrl/Cmd-A), then right-click, go to Auto-explore, then Fast Scrape, then pick IndyDirectory. Within a few seconds, it should have finished scraping, and tell you that you have 139 scraped rows (more than we need, but never mind! We’ll sort that in the next tutorial)

Select all, and then right-click, export as… I usually export in either CSV or Excel – CSV is the safest one to go for, as it’s more basic and doesn’t have any formatting in it.

There’s your first scraper done! In the next tutorial, we’ll clean up the data and verify the postcodes.


, ,

Leave a comment