Internet at paulcarvill.com, the home of Paul Carvill on the web

link: paulcarvill at flickr

paulcarvill.com

Hi, I'm Paul Carvill and I'm a web developer. I am Head of Interface Development at LBi, Europe's largest digital agency.

I also like walking, cooking, Bollywood and rock 'n' roll.

Archive for the ‘Internet’ Category

Using Greasemonkey to Pimp the Telegraph’s MPs’ Expenses Data

Monday, July 13th, 2009

You may have noticed that the Telegraph have some useful MPs’ expenses data on their site. To go with it they’ve also got some nice graphs, which show an MP’s historical expense claim data. The trend of those claims is, as you might expect, mostly upwards. No surprises so far. But, until recently, inflation has always been a positive number, so you would be probably be right to expect expense claims to rise in tandem with the prices of everyday goods and services.

Unpimped Telegraph MPs' expense data charts
Above: the unpimped image of Michael Ancram’s expenses data chart

But I don’t find the graphs useful enough. I can easily see the trend of the claims just by looking at the accompanying data table. What I really want to know is how closely the rises in MPs’ expenses matched the standard inflation rate. My hunch was that the rise in value of the MPs’ claims would be ahead of the inflation rate, so how best to check this? And how to check it for a wide number of claimants? In addition, I wanted to visualise a breakdown of the claim types – for travel, stationary, 2nd home etc.

Pimping the Telegraph's expense data charts
Above: an image of Michael Ancram’s expenses data chart following the addition of my Greasemonkey script.

I was interested enough to ask the Telegraph, via Twitter, if they had any plans to add some economic context to the data. They actually responded to let me know that the data was available in a spreadsheet, so why not go ahead and mash it up? Great! But on checking the spreadsheet I found that the data only went back to 2004, while the data on the Telegraph’s website went back to 2001. I’m aware that expenses data going back further are available from the Houses of Parliament website, but I didn’t really have to time to go searching for them, converting the format to what I needed etc. So I decided to write a quick Greasemonkey script to operate on the data already published by the Telegraph, on their website, which happened to be good enough markup to extract data from and use.

Greasemonkey is a Firefox plugin which enables you to write JavaScript scripts which can operate on the contents of a web page solely within your browser. Anything you do to the data does not alter the original in any way, and so it’s a great way of changing the way websites work, customising them to look how you like and generally mucking about with them.

Here’s some technical info on what I did:

First I checked the Office of National Statistics for the annual inflation data from the Retail Price Index, the broadest measure of inflation, which includes mortgage interest payments. I grabbed data for the years covered by the Telegraph data — 2001/02 to 2007/08 and added it to my script. Then I loaded in a couple of extra scripts which help to grab the data from the page, draw a new graph add it to the page in place of the original.

I extract the data from the MP’s expenses table, which is actually the same way the Telegraph are drawing the graph in the first place. The difference is that I extract each row of data rather than just the totals. I draw all the data to the graph, then add an extra line which extrapolates the MP’s first year total claim across the whole term of the graph, based on the inflation data for the subsequent years. Finally, I added a series of checkboxes to isolate different data sets within the graph, so you can compare, say, just an MP’s claims for food against her claims for stationery.

The script was fairly quickly put together, so if you’re handy with a bit of JavaScript you might be able to think of something I haven’t. For example I’d like to see a comparison of MPs’ claims against nursing or teaching salaries. Even if you just want to hack around you can download it and amend it as much as you like without any danger. But here’s the really great thing — I’ve actually maintained most of the structure of the Telegraph’s original graph code, so if they want to take my script back and use it on their page they’re free to do so, and it should work perfectly!

Was this hack useful? Yes. For example, it allowed me to see that Ian Austin, Labour MP for Dudley North, made below-inflation claims rises for 2006/07 and 2007/08. And that Labour MP for Leeds North East Fabian Hamilton’s 2007/08 total claim (£170,794.00) was almost double what it would have been if his first recorded claim (£75,437.00) from 2001/02 had risen in line with inflation.

You can download my Greasemonkey script here. If you don’t want to do that (or your IT department won’t let you) you can see from the images above how the script operates on the original website.

Some notes about using this script in Greasemonkey:

I wrote this script using Greasemonkey 0.8.2. As of version 0.8 you can use the @require metadata key to load JavaScript libraries into your user-script. Greasemonkey will download those scripts to your local directory, meaning your browser isn’t increasing network traffic every time you run the script. If for any reason you need to use an earlier version of the plugin you’ll find some useful information here on loading in scripts without the @require key.

We don’t want the script to execute except when we’re looking at a Telegraph MPs’ expenses page, so for best performance you should set this script to run on the following URLs:

http://parliament.telegraph.co.uk/mpsexpenses/expenses/*
http://parliament.telegraph.co.uk/mpsexpenses/expense-microsite/expenses/*

Some good links to try with the script installed:

My India travel diary for 2007 and 2008, now online

Sunday, June 14th, 2009

indadiarythumb1I finally got around to typing up my handwritten diaries from my India trip back in 2008. It was a hugely enjoyable exercise reading through the diaries again 12 months later — probably the reason it took me so long to plough through them.

Why put them online? A couple of reasons. I wanted to be able to access them quickly wherever I was, as I often find myself talking to someone about a particular place or event in India that I want to be able to show them a more detailed description of, or sometimes just to remember the name of a hotel to recommend to someone. I also thought they might be useful for other travellers considering a trip to India. Before we went, other than IndiaMike I couldn’t really find any useful ‘on the ground’ reports of day to day travelling around India. In some ways this was a good thing, as we tend to travel extremely independently and this allowed us to travel without any preconceptions. But some people might feel they want to get a flavour of a place before they get there. Also, and probably the most pressing reason, I wanted another nice, simple idea to practice my Django development on.

So I put the new site here: http://www.indiadiary.co.uk. Please check it out and let me know what you think. I’ve included photos of the trip from Flickr and a recommended reading list of the books we went through as we travelled around.

Geocoding location data in a Google spreadsheet

Wednesday, June 3rd, 2009

The problem: I have a spreadsheet full of locations, addresses and place names that I want to publish, along with a map, for at least tens of thousands of people to view.

A solution: Easy — I can put it in a Google spreadsheet, publish it, add a Google map to a page, download the data, geocode the locations and display them on the map.

Another problem: While this is ok in most cases, with a large spreadsheet the geocoding can take a very long time, making my page appear unresponsive and slow. In addition, I have no way of checking that the location data is good enough to map with.

Another solution: Download the data, geocode it using Yahoo!’s Placemaker service, generate a new spreadsheet containing accurate latitude/longitde data and use that in place of the original. The client then does no geocoding their side, it’s all supplied along with the data. Everybody’s happy!

— Go straight to the spreadsheet geocoder! —

I’ve done just that with this PHP script. It takes a Google spreadsheet key, and you must tell it what columns your location data is in. It will download the spreadsheet data, concatenate those location columns, make a request to Placemaker to geocode each location, and return a new CSV file with the geodata columns appended on the end.

I’ve detailed here the various bits that make up the script. The workflow is as follows:

Capture spreadsheet data from user > Load in spreadsheet from Google > For each line in spreadsheet make a Placemaker request > Append geolocation data columns to spreadsheet > Output all results into a CSV file

The script is set to not autodisambiguate, meaning that if it’s not sure what location you’ve supplied, it will return all likely candidates, in order of likelihood. I should mention that Yahoo!’s Placemaker is utterly awesome in find out the ‘whereness‘ of things.

To build your own version of my script will need a Placemaker API key. Other than that, please feel free to copy and paste the code, fix it, amend it and let me know if it’s useful, or if it needs more commenting, or how I could improve it. I wrote this code to fix a particular problem I was encountering, but I’m sure it could work in a few more cases too.

Something to note before I start: the script doesn’t much like having commas in the location data in your spreadsheet. Because Google only output CSV with a comma delimiter, this upsets my CSV parsing. Any suggestions welcome.

This function gets some CSV data from a published Google spreadsheet using a supplied key:


<?php

function getCsvDataFromGoogle($spreadsheetKey) {
	$key = $spreadsheetKey;
	$output = 'csv';
	$apiendpoint = 'http://spreadsheets.google.com/pub?key='.$key.'&output='.$output;
	$ch = curl_init();
	$options = array(CURLOPT_URL => $apiendpoint,
	                 CURLOPT_HEADER => false,
	                 CURLOPT_RETURNTRANSFER => true
	                );
	curl_setopt_array($ch, $options);
	$r = curl_exec($ch);
       curl_close($ch);
	return $r;
}

This function makes a Placemaker geocode request:


function getPlacemakerGeodata ($location) {
	$key = 'MY_PLACEMAKER_API_KEY';
	$apiendpoint = 'http://wherein.yahooapis.com/v1/document';
	$inputType = 'text/plain';
	$outputType = 'xml';
	$focus = '28298150'; // sets focus to Great Britain, not sure how effective this is yet
	$autoDisambiguate = 'false'; // returns the 1 most-likely place, else returns many likely places
	$post = 'appid='.$key.'&documentContent='.$location.'&documentType='.$inputType.'&outputType='.$outputType.'&focusWoeid='.$focus.'&autoDisambiguate='.$autoDisambiguate;
	$ch = curl_init($apiendpoint);
	curl_setopt($ch, CURLOPT_POST, 1);
	curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	$results = curl_exec($ch);
	return $results;
}

This function does the bulk of the work, and makes calls to all the other functions:


function parseCsvData($googleSpreadsheetKey) {
	$lines=split( "\n", getCsvDataFromGoogle($googleSpreadsheetKey) );
	if($_POST['format'] == 'csv') {
		if($_POST['locationColumns'] == '' || $_POST['key'] == '') {
			echo "please go back and specify both your google spreadsheet key and which columns contain your location data (in comma separated format, zero-indexed e.g. 0,1,9)";
			exit();
		}
		else {
			// get location columns from url
			$locations = $_POST['locationColumns'];
			$splitLocations = split(',', $locations);
			// set headers to 'csv'
			header("Content-type: application/csv;");
			header("Content-Disposition: attachment; filename=yourgeodata.csv");
			$out = fopen('php://output', 'w');
			for($i=1;$i

This function parses the XML which gets returned from Yahoo! Placemaker:


function parsePlacemakerXML($results, $delineator) {
	if($delineator == 'comma') { $delStart = ''; $delEnd = ','; }
	else { $delStart = '<td>'; $delEnd = '</td>'; }

	$places = simplexml_load_string($results, 'SimpleXMLElement', LIBXML_NOCDATA);
	$locarr = array();
	if($places->document->placeDetails) {
		foreach($places->document->placeDetails as $p) {
			if($delineator == 'comma') {
				$locarr[] = $p->place->name;
				$locarr[] = $p->place->centroid->latitude;
				$locarr[] = $p->place->centroid->longitude;
				return $locarr;
			}
			else {
				echo $delStart.$p->place->name.$delEnd;
				echo $delStart.$p->place->centroid->latitude.$delEnd;
				echo $delStart.$p->place->centroid->longitude.$delEnd;
			}
		}
	}
}

This bit runs when you load the page and works out if you're submitting some data or just viewing the page. If you've submitted data, it runs the main function:

if(ISSET($_POST['submit'])) {
	parseCsvData($_POST['key']);
}

Or if you're viewing the page for 1st time, you get a form to fill out:

else {
	echo "<html><head><title></title></head>";
	echo "<body>";
	echo "<p>Please enter your spreadsheet key and specify which columns contain your location data (use comma separated list e.g. 9,10,11):</p>";
	echo "<form method=POST><p><label>Key:<input type='text' name='key' /></label></p><p><label>Location columns: <input type='text' name='locationColumns' /></label></p><p><label>Format: <select name='format'><option value='csv'>csv</option><option value='table'>table</option></select></label></p><p><input type='submit' name='submit' /></p></form>";
	echo "</body></html>";
}
?>

HTML for humans

Saturday, April 18th, 2009

The most recent Road to HTML5 blogpost is on the subject of link relations. To sum up in one sentence, link relations ‘explain why you’re pointing to another page’. But that is to miss all the detail and nuance of the article, so go there and read it if you’re at all interested in web development.

The whatwg blog continues to be a friendly, highly readable and discursive source of information on upcoming HTML5 developments. It also clarifies many misconceptions you may have had about previous HTML specs and the reasoning behind any changes, often in a satisfyingly wry style. On the subject of the ‘rev=made’ attribute it has this to say,

“The decision to drop the rev attribute [from HTML5] seems especially controversial. The same question flares up again and again on the working group’s mailing list: “what happened to the rev attribute?” But in the face of almost-universal misunderstanding (among people who try to use it) and apathy (among everyone else), no one has ever made a convincing case for keeping it that didn’t boil down to “I wish the world were different.” Hey, so do I, man. So do I.”

http://blog.whatwg.org/

Why people should build their own URL shorteners

Sunday, April 5th, 2009

Lots of blogposts regarding the evil that URL shortening services do appeared this weekend, from Jason Kottke (“URL shorteners suck”), Joshua Schachter, creator of Delicious (“on url shorteners”) and Dave Winer (“Josh is right, URL shorteners are risky”).

None of them are happy, with concerns, variously, about spam, speed, efficiency, transparency, longevity and what Joshua calls “the great linkrot apocalypse”.

I think the one idea that we should take from all of these arguments is that if you are using a URL shortening service you have no control over your links, now or in the future. One such service may get bought up by a nasty commercial entity who redirects all your existing short URLs to its own ends. Your URLs pointing to all your lovingly curated content will, effectively, become spam.

One solution to this quandry, which no one has mentioned, and one which large media organisations should pay attention to, is that building a URL shortener is really, really easy. I spoke to Simon Willison about it and he thinks a URL shortener will soon be the example app that someone builds to learn their way around a new language or web framework, like they currently do with a creating a blog. That way you get to solve many problems at once: control over your own links’ destiny and complete consumer confidence that your own-brand short URLs (gu.com/abcde, nyt.com/vwxyz) won’t take them someone nasty.

And the ability to shorten your own URLs isn’t necessarily restricted to large companies with lots of resources. Many people who want to use this sort of service already have all the tools they need — their blogging software. All that Movable Type or Wordpress, among others, need to do is add an extra database lookup table and pretty soon all their users can take care of their own URL shortening needs.

UPDATE: A useful comparison of existing URL shortening services (even the method of redirection matters: 301 is a permanent redirect, 302 a temporary one, with implications for SEO and link credit).

How to do music lists

Monday, March 16th, 2009

How to do a list of songs on a newspaper or magazine website:

And how not to:

  • The Telegraph’s 100 Greatest Songs Of All Time — in which the adjudicating panel of one – Neil McCormick – hilariously abandons grammar in favour of enigmatic SMS-length capsule reviews. Sample description of The Doors’ Light My Fire, “Provocative, sensual, slinky song weaving erotic desire.” And another one of U2’s Still Haven’t Found What I’m Looking For: “Gospel rock hymn of doubt and spiritual quest.”
  • Esquire’s 50 Songs Every Man Should Be Listening To — let me get this straight: you want me to click through at least 50 times, even more with ads, and if I haven’t heard the band you’re talking about the onus is on me to wade through the internet to find one of their songs to listen to? OK, thanks.

More coverage of the Guardian’s Open Platform API

Thursday, March 12th, 2009

Some more links to coverage of the Guardian’s Open Platform API announcement (some of these found via blogs.journalism.co.uk, thanks):

What people are saying about the Guardian’s Open Platform

Wednesday, March 11th, 2009

I’ve collected together some of the recent articles and blogposts about yesterday’s announcement of the Guardian’s Open Platform strategy and, below, some of the first apps that people have built:

And here are some of the first apps that people have built using the Open Platform API:

Guardian Open Platform

Wednesday, March 11th, 2009

There was exciting news yesterday morning, when we announced the next stage of the Guardian’s stated strategy to be the world’s leading liberal voice. The Guardian is opening out — making our content available for other people to use — and also opening in — allowing developers to build on our platform and deploy applications which extend its functionality.

So, the headlines from the announcement are:

  • Open Platform API
    • search, query, filter and discover content, keywords and tags from the Guardian’s archive
    • contains full textual content of all Guardian articles going back to 1999
    • currently in private beta (apply for a key)
    • free for the first 5000 queries per day
    • can be used for commercial purposes (you can make money by running ads with it)
    • it will at some point in the future be ad-supported on pages using the full content
  • Data Store
    • a curated collection of data sets
    • researched, verified and attributed to its source
    • hosted on Google Docs and free to use
    • covering subjects such as diverse as US economic data, environmental statistics, crime figures and religious information
  • Data Blog
    • accompanies the Data Store
    • will provide information around the raw data: how we sourced it, why we use that particular data set, what the information might mean

This constitutes a wealth of information to announce in one go, and it may take people some time to digest it all. The really exciting thing about this move is that we’re putting the full content of our articles out there for people to use. The implications for data mining, linguistic research and deep textual comparisons are endless, and I’m really looking forward to seeing what people come up with. Having context to the data is really important, so people can do much, much more than just link back to our site using a headline or an excerpt.

The Data Store is also a really bold move. Simon Rogers, one of our News Editors, and the journalists here put amazing amounts of effort into research, and here we are returning the fruit of their labour into the community. Of course, we use that research to report and editorialise, but here we give you the opportunity to derive your own patterns and meaning from the same data. The fact that this stuff has been manually sourced, collated and published makes it mean so much more, and I’m sure people, including other journalists, will find it an increasingly useful source of information for years to come.

I’ve collected some useful links here which are specifically related to the Open Platform:

I’ve also collected some of the news coverage and blogposts about the announcement here:

Forking hell! OSX, PHP, GD, Freetype problems? Read this…

Sunday, November 16th, 2008

So, I was trying to make a set of Moo cards, using the MOO API, as part of The Guardian’s first ever Hack Day. It’s very easy and fun to use, and I enjoyed the learning process of formatting the images and data and submitting the constructed XML to MOO to print the cards. But…

But, the formats available from MOO are quite restrictive. This is understandable, as they want to retain some control over quality and their own branding, which is held in high esteem. For example, you can only ever put an image on the front of the card, and text on the back. I wanted image and text on the front.

I was using PHP to create the XML to postto MOO, so now I needed to learn how to use ImageMagick to merge some text into the image I was using. Unfortunately I’m not a command line geek, so I tend to get stuck when someone tells me to compile PHP. Luckily, someone was on hand to help me install GD, which is considerably easier to use.

I used GD to merge text into the image using imagestring. But I wasn’t able to successfully specify the fonts to use – every image was rendered with the default system font. GD wouldn’t work. Then I tried using imagettftext. This resulted in a blank page. I was using GD 2.3.5 on PHP 5. Eventually I found a link which explains a problem with Apple’s default implementation of Freetype in GD that crashes GD if you try and specify a font.

The result? I installed Macports, and updated PHP and Apache that way, resulting in a new install with GD 2.3.7

And it’s all working now! Now I’ve gone over the problems, I’ll post a bit more about actually creating the cards next.