Thursday, June 3, 2010

Parsing HTML with Java and the syntactic sugar of (jQuery like) CSS selectors

In my last post I wrote about how to get a websites content easily using the HttpClient API wrapped by the fluent builder. In most cases you'll probably want to extract (scrape) some data from this content. There are several ways to do it in Java, beginning with a naive approach like

// ...
final String titleAnchor = "<h1 class='title'>";
final int pos = content.indexOf(titleAnchor) + titleAnchor.length();
final String blogTitle = content.substring(pos, content.indexOf("</h1>", pos));
over a pure DOM search approach using the classical getElementsBy* methods up to more elegant HTML parsers like JTidy combined with XPath to extract the data
// ...
final XPathExpression expr = xpath.compile("//h1[@class='title']/text()");
final String blogTitle = (String) expr.evaluate(document, XPathConstants.STRING);

Even if using XPath is a really neat approach, some of you, especially being familiar with e.g. jQuery, would like to use CSS selectors to extract data from a HTML page. Some months ago I though about developing a Java API similiar to jQuery's selectors API. To be sure to not reinvent the wheel I started some research and came across jsoup. That was it! After a short look into the project and its great documentation I felt in love with jsoup - at this point a big thanks to Jonathan Hedley for beating me to the punch ;-). Jsoup can parse HTML from String, URL or File and provides "the best of DOM, CSS, and jquery-like methods" - and thats really right! Lets waste no time and look on some simple example:

final URL url = new URL("http://mastacode.blogspot.com");
final Document doc = Jsoup.parse(url, 1000);
final String blogTitle = doc.select("h1.title").text();

That's all! It's really simple, isn't it? :-) Btw, I prefer to use the HttpClient to get the content and finally parse it with jsoup (e.g. because of the cookie / session management). If you like my fluent builder - and I hope you like ;-) - here is a Handlers class containing a jsoup ResponseHandler implementation:

public final class Handlers {
    
    Handlers() { }
    
    public static final ResponseHandler<Document> JSOUP = new JsoupHandler();
    public static final ResponseHandler<Document> JSOUP_UTF8 = new JsoupHandler("UTF-8");
    
    private static class JsoupHandler implements ResponseHandler<Document> {
        private final String charset;
        
        JsoupHandler() { this(null); }
        JsoupHandler(final String charset) {
            this.charset = charset;
        }
        
        @Override
        public Document handleResponse(final HttpResponse response)
                throws IOException {
            return Jsoup.parse(Http.asString(response, charset));
        }
    }
}

And here an example how to use it with the builder:

final Document doc = Http.get("http://mastacode.blogspot.com").use(client).as(Handlers.JSOUP);
final String blogTitle = doc.select("h1.title").text();

Although jsoup is still in beta, you should give it a try - I'm sure you'll love it too :-)!

No comments:

Post a Comment