Sunday, 17 November 2013

How to extract information from html

People like to say that we live in global village. Internet access gave us many possibilities but everything comes with the price.

Informations in the Internet are not well-organised. Typical webmaster focuses on how data is presented not how are they stored. Web pages are not easy to parse. Hardly ever we find extracting informations from them to be an easy task.

Sure... More and more websites is offering some kind of public API which allows easier development but it is only drop in the ocean. Usually we are forced to work with raw html.

Html is not an easy language to work with. Unlike xml, html pages does not have to follow strict syntax (in example not all tags have to be closed). Thus we cannot use xQuery (which is extremely powerful).

So how we can extract data from html?


Luckily there is a way.

Some good souls have created library called jsoup. It is "Java library for working with real-world HTML".

I will introduce it through example.

Lets say I want to show to my user informations about one of the best films of all times. In order to this I have to connect with some website about movies and get data from it.

I will use fallowing page:
http://www.allmovie.com/movie/the-good-the-bad-and-the-ugly-v20333

1. Prerequisites


Firstly add jsoup library to your project. You can do this using maven dependency:

1
2
3
4
5
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.7.2</version>
</dependency>

2. Parsing web page

 

In order to extract data we have to parse web page. It is ridiculously easy with jsoup:

1
Document doc = Jsoup.connect("http://www.allmovie.com/movie/the-good-the-bad-and-the-ugly-v20333").get();

And that is all. Library will do all necessary work: connecting, downloading and parsing. After that we are able to process document.

3. Extracting informations

 

To extract information we have to analyse html document. We have to understand its hierarchy to prepare query (yes - we are using queries to select elements).

Most interesting film informations are placed inside div tag which class attribute is equal to "side-details". In this elements we have dt tags (descriptions) immediately followed by dd tags which contains data we are searching for.

Basically it looks like this:

1
2
3
4
5
6
7
8
<div class="side-details">   
    <dt>genres</dt>
    <dd>
        <ul class="warning-list">
            <li><a href="http://www.allmovie.com/genre/western-d656">Western</a></li>
        </ul>
    </dd>
</div>

Here is how we can get film genre:

1
2
Element element = doc.select("* div[class=side-details] dt:contains(genres) + dd").first();
String genre = element.text();

Selectors are described at jquery official website. I encourage you to read it.

* means we are selecting all elements (at root level). Next we have space and another selector. Space has special importance - it means that second selector (div[class=side-details]) will be used to evaluate all descendants (at any level in hierarchy). div[class=side-details] will choose all div elements whose have class attribute equals to "side-details". In descendants of div element (remember about space) we are searching for dt element containing text "genres". Now we are using + operator - it is used to select all following siblings. From siblings we are selecting only dd elements and then using java function we are getting first of them.

Done. Using only one short line we were able to get useful information from real-word website.

Monday, 12 August 2013

Enums and interfaces? Interesting connection

We often have to make a decision based on enum value. Usually it ends in long switch statement. What if I told you there is a better way to do this?

1. What do we want?


We want our enum to be self-describing and to achive this we created interface:

1
2
3
public interface SelfDescribing {
 String getDescription();
}


2. What do we have?


1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
public enum MessageType implements SelfDescribing {
 INFO, WARNING, ERROR;

 public String getDescription() {
  switch (this) {
  case ERROR:
   return "What do you want? Go fix your problem.";
  case INFO:
   return "Hi i'm INFO Message";
  case WARNING:
   return "Hi i'm WARNING Message";
  default:
   return null;
  }
 }
}


What we see here is switch statement. It's just ugly and after you add new message type you will have to remember about extending it. Compiler won't remind you about it.

3. What is more elegant solution?


1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
public enum MessageType implements SelfDescribing{
 INFO
 {
  public String getDescription() {
   return "Hi i'm INFO Message";
  }
  
 }, WARNING
 {
  public String getDescription() {
   return "Hi i'm WARNING Message";
  }
 }, ERROR
 {
  public String getDescription() {
   return "What do you want? Go fix your problem.";
  }
 };
}


It's cleaner, more focused on goal and less error-prone.

Monday, 5 August 2013

Why do we need Mockito?

It's impossible to create huge project without using unit tests. It seems that their popularity started with xUnit libraries family. Possibility to write hundreds of unit tests and ease of understanding results ("green bar") change the way we're programming. But what if we can't  test our class using only JUnit?


1. Let's define the problem

Consider situation when our class is using other object method but due to some reasons we can't use that object in unit test.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
public final class Publisher {
 private final PlagiarismChecker plagiarismChecker;

 public Publisher(PlagiarismChecker plagiarismChecker) {
  this.plagiarismChecker = plagiarismChecker;
 }

 public void publishNewNovel(Writer writer) throws Plagiarism {
  Novel novel = writer.getNovel();
  plagiarismChecker.check(novel);
  publish(novel);
 }

 private void publish(Novel novel) { 
 }
}
1
2
3
public interface PlagiarismChecker {
 void check(Novel novel) throws Plagiarism;
}

Our publisher wants to publish new novel. Unfortunately during tests we cannot check if novel is plagiarism. This operation is very long and needs access to database. For these reasons we cannot use default PlagiarismChecker implementation in test environment.

2. Solution

2.1 Prerequisites

Firstly we have to design our class to use some kind of dependency injection. I don't order you to use Spring or Guice. My point is to destroy tight-coupling between classes. We have to have possibility to change object dependences.

2.2 Possible solutions

When we gain possibility to change object dependences we have to decide what to inject. There are three basic approaches. We can use:

  • Stubs - they are the simplest implementation of some interface. We usually return hard-coded values.
    Dilbert's comics is great example.
  • Fakes - they are more advanced then Stubs. They offering some basic functionalities, but they are operating in simplified way. In our example PlagiarismChecker could compare novel title with some existing list.
  • Mocks - helps with assertions. Can return hard-coded and calculated values. They are usually created by third-parties libraries. 
We are going to try mocks using Mockito.

2.3 Adding Mockito to project

We have to add dependency to w pom.xml (i assume we are using Maven).

1
2
3
4
5
<dependency>
 <groupId>org.mockito</groupId>
 <artifactId>mockito-all</artifactId>
 <version>1.9.5</version>
</dependency>

That's all. Now we can use Mockito. In particular we can create mocks.

 2.4 First mock

1
2
PlagiarismChecker plagiarismCheckerMock = mock(PlagiarismChecker.class);
when(plagiarismCheckerMock.check(Matchers.any(Novel.class))).thenReturn(false);

This two lines have created fully "functional" PlagiarismChecker object. When w invoke method check (for any Novel) we will get false. Short explanation:
  1. We used static Mockito function when. Mockito library was designed to be  easy to read and understand. Consider when as begining of the sentence.
  2. We are passing argument to when method. In this case it is call to method check of plagiarismCheckerMock. To specify call arguments we are using Matchers.
  3. Finally we are saying what is going to happen after our call.
2.5 Final test

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.when;

import org.junit.Test;
import org.mockito.Matchers;

public final class PublisherTest {

 @Test
 public void testGetNovel() throws Plagiarism {
  PlagiarismChecker plagiarismCheckerMock = mock(PlagiarismChecker.class);
  when(plagiarismCheckerMock.check(Matchers.any(Novel.class)))
    .thenReturn(false);

  Writer writerMock = mock(Writer.class);
  when(writerMock.getNovel()).thenReturn(new Novel());

  Publisher publisher = new Publisher(plagiarismCheckerMock);
  publisher.publishNewNovel(writerMock);
 }
}

What's important we are not testing Writer and PlagiarismChecker. Out test if fully independent.

3. Summary

I showed the simplest usage of Mockito. You need to remember that it's very powerful library. It offers: counting methods call, throwing exceptions, diffrent method results for diffrent parameters and many others.