How to use JSOUP for Web Scraping / Crawling a website in Java
Please Subscribe Youtube| Like Facebook | Follow Twitter
Introduction
In this article we will learn how to crawl/scrape html web page using JSOUP. Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Our Example
In our example we will parse/scrape following example html page. It contains html table, we will process person table and print its data in console.
Html Code
<!DOCTYPE html>
<html>
<head>
<title>
Jsoup Example
</title>
<style>
table, th, td {
border: 1px solid black;
}
</style>
</head>
<body>
<h2>Person Table</h2>
<table style="width:100%" id="person_table">
<tr>
<th>Name</th>
<th>Age</th>
<th>Website</th>
</tr>
<tr>
<td>Sameer</td>
<td>25</td>
<td><a href="https://www.programtown.com">Link 1</a></td>
</tr>
<tr>
<td>Waleed</td>
<td>30</td>
<td><a href="https://programtown.com">Link 2</a></td>
</tr>
<tr>
<td>Hamid</td>
<td>33</td>
<td><a href="http://www.programtown.com">Link 3</a></td>
</tr>
</table>
</body>
</html>
Requirements:
- Jsoup Library
Steps
Follow below steps
1) Download and integrate JSOUP library to your project
2) Add example html file to your project
3) Code for scraping html page
4) Run and test your project
1) Download and integrate JSOUP library to your project
First go to jsoup site and download library jsoup-1.13.1.jar file from here.
Then in eclipse create a new folder named “lib”
Copy and paste jsoup-1.13.1.jar in lib folder
Select jsoup-1.13.1.jar and Add to build path
2) Add example html file to your project
Create new folder named “temp” and download example.html and paste it inside temp folder.
3) Code for scraping html page
First we have loaded html file in our project then parsed it inside doc object using jsoup.parse() method. Then using select method of Jsoup css selector we have selected html Person table using its id “#person_table”. After that we have got rows of table inside PersonRecords Elements using selector tr. We have traversed table rows excluding heading and process td/column data of each table row using loop. Table column data 1 and 2 contains name and age so we have processed and printed it. Table column 3 contains link, therefore we have selected first anchor “a” and then get its “href” attribute. Finally after program execution Person Table data will be printed in console.
File input = new File("temp/example.html");
try {
Document doc = Jsoup.parse(input, "UTF-8", "");
Elements personTable = doc.select("#person_table");
Elements personRecords = personTable.select("tr");
int i=1;
for(Element personRecord:personRecords) {
Elements person = personRecord.select("td");
if(person.size()==3) {
System.out.print(i);
System.out.print(" Name "+person.get(0).text());
System.out.print(" Age "+person.get(1).text());
System.out.print(" Website "+person.get(2).select("a").attr("href"));
System.out.println();
i++;
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Whole Code
JSOUPExample.java
package com.programtown.example;
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JSOUPExample {
public static void main(String[] args) {
File input = new File("temp/example.html");
try {
Document doc = Jsoup.parse(input, "UTF-8", "");
Elements personTable = doc.select("#person_table");
Elements personRecords = personTable.select("tr");
int i=1;
for(Element personRecord:personRecords) {
Elements person = personRecord.select("td");
if(person.size()==3) {
System.out.print(i);
System.out.print(" Name "+person.get(0).text());
System.out.print(" Age "+person.get(1).text());
System.out.print(" Website "+person.get(2).select("a").attr("href"));
System.out.println();
i++;
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
4) Run and test your project
After running program Person table data will be printed on console.
Conclusion
In this post we have learned how to parse/scarpe html page in java.
Please Subscribe Youtube| Like Facebook | Follow Twitter