thedoedoeblog

Musings of a small game development team

Scraping iTunes: Setup

Written by Bill Soistmann on March 17, 2009 at 8:35 pm

With the release of Bailout America we wanted to find a way to display the game’s App Store rating on the website without editing it manually, so we decided to scrape the information from iTunes automatically. I figured if I could pull the number of stars as a numeral I could iterate and build out the star images.

What I’ve ended up with is much more and I thought someone else might benefit from it. Over the next several days I will post details about how I ended up with our current version as well as some idea how you can put it to use for your own benefit. Today we will start with the basics - how to scrape the iTunes store.

The first step is to decide what data we need and on what page we can find it. We launch iTunes and browse to our application. We find the link that reads 4 Reviews for all versions and click it which takes us to a page that shows the number of stars near the top right hand side of the page. If I can grab that, we’ll be in business. So, we click the back button, find that link again, control-click on it and copy the iTunes URL.

Copy URL

Then we launch a text editor and paste in the URL and not that the important part of the URL is the application id which you find in the query string:

http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewContentsUserReviews?id=304570595&pageNumber=0&sortOrdering=1&type=Purple+Software

Now we use curl and see what kind of data we get from iTunes. We will use two command line options to ensure we can grab the data:

  1. -A

    to set the User-Agent.

    I use iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)
  2. -H

    to send another header

    X-Apple-Store-Front: 143441-1

    143441 is the country code for the U.S.
curl -A "iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2" -H "X-Apple-Store-Front: 143441-1" 'http://ax.phobos.apple.com.edgesuite.net/WebObjects/MZStore.woa/wa/viewContentsUserReviews?id=304570595&pageNumber=0&sortOrdering=2&type=Purple+Software' > ~/Desktop/foo.html

Now we open ~/Desktop/foo.html in our favorite text editor and take a look at lines 667 - 674:

<TextView topInset="0" truncation="right" leftInset="0" styleSet="basic13" textJust="left" maxLines="1"><SetFontStyle normalStyle="textColor">Average rating for all versions: </SetFontStyle></TextView>

 <View alt="" stretchiness="1"></View>
</VBoxView>
   <VBoxView alt="">
 <View alt="" stretchiness="1"></View>

     <HBoxView rightInset="6" alt="4 stars" leftInset="6">

What we are after is the part that reads 4 stars on line 674. We could use regular expressions or some other technique to pull out what we need. Easiest thing in this case is to split the file up. Let’s use Perl for this. First we call use curl as we did before and assign the output to a variable by using the backticks. We will also throw in the -s option to suppress the progress indicators:

my $riz = `curl -s -A "iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2" -H "X-Apple-Store-Front: 143441-1" 'http://ax.phobos.apple.com.edgesuite.net/WebObjects/MZStore.woa/wa/viewContentsUserReviews?id=304570595&pageNumber=0&sortOrdering=2&type=Purple+Software'`;	

Then we split using the section Average rating for. Some applications will have Average rating for all versions: and others will have Average rating for the current version: so by using just Average rating for we will get what we want in all cases:

$riz = substr($riz, index($riz, 'Average rating for', 0), 300);

So, we split it and keep the second half. Then, we split again just before the 4 by using rightInset="6" alt=" and keep the second half. Next, we split at a simple ", keep the first half and then grab the first character:

$riz = substr($riz, index($riz, "rightInset=\"6\" alt=\"", 0)+20, 100);
$riz = substr($riz, 0, index($riz, '"', 0));
$result = substr($riz, 0, 1);

The only thing to do now is look for any half stars. If this app had 4½ stars, we would have found 4 and a half stars so we need to check for that and then print the result:

if(index($riz, 'and a half', 1) > 0){$result += 0.5;}
print $result;

Anytime we run the script we will get a numeral indicating the number of stars. We can throw in a bit more to generalize it so we can grab ratings for other apps or other stores. This should do the trick:

#! /usr/pkg/bin/perl -w
#! /usr/pkg/bin/curl

use strict;
use warnings;

my $result = -1;
my $store;
my $coCode;
my $appId;
my $counter = 0;
my $currentSoftware;
my $temp;
my $flag = 0;

print "Content-type: text/html\n\n";
splitVars();
$currentSoftware = $appId;
$store = 143441;getReviews();
print $result;		

sub splitVars{
my $item = "";
my $riz = $ENV{'QUERY_STRING'};
if($riz){
	my @rizray = split("&", $riz);
	foreach $item (@rizray){
		if(index($item, "id", 0) >= 0){
			$appId = substr($item, index($item, "id", 0)+3, length($item));
		}
		elsif(index($item, "country", 0) >= 0){
			$coCode = substr($item, index($item, "country", 0) + 8, length($item));
		}
	}
}
}

sub getReviews
{
    my $riz = "test";
    $riz = `curl -s -A "iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2" -H "X-Apple-Store-Front: $store-1" 'http://ax.phobos.apple.com.edgesuite.net/WebObjects/MZStore.woa/wa/viewContentsUserReviews?id=$currentSoftware&pageNumber=0&sortOrdering=2&type=Purple+Software'`;

   $riz = substr($riz, index($riz, 'Average rating for', 0), 300);
   $riz = substr($riz, index($riz, "rightInset=\"6\" alt=\"", 0)+20, 100);
   $riz = substr($riz, 0, index($riz, '"', 0));
   if(length($riz) != 0){
	if($result == -1){$result = substr($riz, 0, 1);}
	else{$result += int(substr($riz, 0, 1));}
	if(index($riz, 'and a half', 1) > 0){$result += 0.5;}
	$counter++;
   }
}

Now we save it as getstars.cgi, make it executable, and put it in our webspace and we’re all set. Hit the URL with http://yourdomain.com/path/getstars.cgi?id=304570595 and we get the number of stars.

Ignore the country code in the Perl for now — I will add more tomorrow to take advantage of that and I didn’t feel like stripping it out for today’s post.

This is part of a series of posts which start here. The next post is here.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

You must be logged in to post a comment.