How To Use Jsoup For Web Scraping / CrawlingWebsite In Java

How to use JSOUP for Web Scraping / Crawling a website in Java


Please Subscribe Youtube| Like Facebook | Follow Twitter

Introduction

In this article we will learn how to crawl/scrape html web page using JSOUP. Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Our Example

In our example we will parse/scrape following example html page. It contains html table, we will process person table and print its data in console.

Html Code

<!DOCTYPE html>
<html>
<head>
<title>
Jsoup Example
</title>
<style>
table, th, td {
  border: 1px solid black;
}
</style>
</head>
<body>

<h2>Person Table</h2>

<table style="width:100%" id="person_table">
  <tr>
    <th>Name</th>
    <th>Age</th> 
    <th>Website</th>
  </tr>
  <tr>
    <td>Sameer</td>
    <td>25</td>
    <td><a href="https://www.programtown.com">Link 1</a></td>
  </tr>
  <tr>
    <td>Waleed</td>
    <td>30</td>
    <td><a href="https://programtown.com">Link 2</a></td>
  </tr>
  <tr>
    <td>Hamid</td>
    <td>33</td>
    <td><a href="http://www.programtown.com">Link 3</a></td>
  </tr>
</table>

</body>
</html>

Requirements:

  1. Jsoup Library

Steps

Follow below steps

1) Download and integrate JSOUP library to your project

2) Add example html file to your project

3) Code for scraping html page

4) Run and test your project

1) Download and integrate JSOUP library to your project

First go to jsoup site and download library jsoup-1.13.1.jar file from here.

Then in eclipse create a new folder named “lib”

Copy and paste jsoup-1.13.1.jar in lib folder

Select jsoup-1.13.1.jar and Add to build path

2) Add example html file to your project

Create new folder named “temp” and download example.html and paste it inside temp folder.

3) Code for scraping html page

First we have loaded html file in our project then parsed it inside doc object using jsoup.parse() method. Then using select method of Jsoup css selector we have selected html Person table using its id “#person_table”. After that we have got rows of table inside PersonRecords Elements using selector tr. We have traversed table rows excluding heading and process td/column data of each table row using loop. Table column data 1 and 2 contains name and age so we have processed and printed it. Table column 3 contains link, therefore we have selected first anchor “a” and then get its “href” attribute. Finally after program execution Person Table data will be printed in console.

File input = new File("temp/example.html");
try {		
	Document doc = Jsoup.parse(input, "UTF-8", "");
	Elements personTable = doc.select("#person_table");
	Elements personRecords = personTable.select("tr");
	int i=1;
	for(Element personRecord:personRecords) { 
		Elements person = personRecord.select("td");
		if(person.size()==3) {
			System.out.print(i);
			System.out.print(" Name "+person.get(0).text());
			System.out.print(" Age "+person.get(1).text());
			System.out.print(" Website "+person.get(2).select("a").attr("href"));
			System.out.println();
			i++;
		}
	}
	
} catch (IOException e) {
	// TODO Auto-generated catch block
	e.printStackTrace();
}

Whole Code

JSOUPExample.java

package com.programtown.example;

import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JSOUPExample {

	public static void main(String[] args) {
		File input = new File("temp/example.html");
		try {		
			Document doc = Jsoup.parse(input, "UTF-8", "");
			Elements personTable = doc.select("#person_table");
			Elements personRecords = personTable.select("tr");
			int i=1;
			for(Element personRecord:personRecords) { 
				Elements person = personRecord.select("td");
				if(person.size()==3) {
					System.out.print(i);
					System.out.print(" Name "+person.get(0).text());
					System.out.print(" Age "+person.get(1).text());
					System.out.print(" Website "+person.get(2).select("a").attr("href"));
					System.out.println();
					i++;
				}
			}
			
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

}

 4) Run and test your project

After running program Person table data will be printed on console.

Conclusion

In this post we have learned how to parse/scarpe html page in java.

 
Please Subscribe Youtube| Like Facebook | Follow Twitter


Leave a Reply

Your email address will not be published. Required fields are marked *