People like to say that we live in global village. Internet access gave us many possibilities but everything comes with the price.
Informations in the Internet are not well-organised. Typical webmaster focuses on how data is presented not how are they stored. Web pages are not easy to parse. Hardly ever we find extracting informations from them to be an easy task.
Sure... More and more websites is offering some kind of public API which allows easier development but it is only drop in the ocean. Usually we are forced to work with raw html.
Html is not an easy language to work with. Unlike xml, html pages does not have to follow strict syntax (in example not all tags have to be closed). Thus we cannot use xQuery (which is extremely powerful).
So how we can extract data from html?
Luckily there is a way.
Some good souls have created library called jsoup. It is "Java library for working with real-world HTML".
I will introduce it through example.
Lets say I want to show to my user informations about one of the best films of all times. In order to this I have to connect with some website about movies and get data from it.
I will use fallowing page:
http://www.allmovie.com/movie/the-good-the-bad-and-the-ugly-v20333
1. Prerequisites
Firstly add jsoup library to your project. You can do this using maven dependency:
1 2 3 4 5 | <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.7.2</version> </dependency> |
2. Parsing web page
In order to extract data we have to parse web page. It is ridiculously easy with jsoup:
1 | Document doc = Jsoup.connect("http://www.allmovie.com/movie/the-good-the-bad-and-the-ugly-v20333").get(); |
And that is all. Library will do all necessary work: connecting, downloading and parsing. After that we are able to process document.
Most interesting film informations are placed inside div tag which class attribute is equal to "side-details". In this elements we have dt tags (descriptions) immediately followed by dd tags which contains data we are searching for.
Basically it looks like this:
3. Extracting informations
To extract information we have to analyse html document. We have to understand its hierarchy to prepare query (yes - we are using queries to select elements).
Most interesting film informations are placed inside div tag which class attribute is equal to "side-details". In this elements we have dt tags (descriptions) immediately followed by dd tags which contains data we are searching for.
Basically it looks like this:
1 2 3 4 5 6 7 8 | <div class="side-details"> <dt>genres</dt> <dd> <ul class="warning-list"> <li><a href="http://www.allmovie.com/genre/western-d656">Western</a></li> </ul> </dd> </div> |
Here is how we can get film genre:
1 2 | Element element = doc.select("* div[class=side-details] dt:contains(genres) + dd").first(); String genre = element.text(); |
Selectors are described at jquery official website. I encourage you to read it.
* means we are selecting all elements (at root level). Next we have space and another selector. Space has special importance - it means that second selector (div[class=side-details]) will be used to evaluate all descendants (at any level in hierarchy). div[class=side-details] will choose all div elements whose have class attribute equals to "side-details". In descendants of div element (remember about space) we are searching for dt element containing text "genres". Now we are using + operator - it is used to select all following siblings. From siblings we are selecting only dd elements and then using java function we are getting first of them.
Done. Using only one short line we were able to get useful information from real-word website.