Spring Boot integrates jsoup to implement html parsing

HBLOG
3 min readJun 25, 2024

--

1. What is jsoup

jsoup is an HTML parser for Java that can directly parse a URL address and HTML text content. It provides a very low-effort API to fetch and manipulate data via DOM, CSS, and jQuery-like manipulation methods, manipulating HTML elements, attributes, and text.

JSoup features

jsoup implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers.

  • Extract and parse HTML from URLs, files, or strings.
  • Find and extract data, using DOM traversal or CSS selectors.
  • Manipulating HTML elements, attributes, and text.
  • Clean up user submissions against a secure whitelist to prevent XSS attacks.
  • Output clean HTML.

JSoup main class

Most often, given below 3 Classes are what we need to focus on.

Jsoup

The Jsoup class is the entry point to any Jsoup program and will provide methods for loading and parsing HTML documents from a variety of sources. Some of the important methods of the Jsoup class are as follows:

methoddescriptionstatic Connection connect(String url)A connection that creates and returns a URL.static Document parse(File in, String charsetName)Parses the specified character set file into a document.static Document parse(String html)Parse the given HTML code into a document.static String clean(String bodyHtml, Whitelist whitelist)Returns safe HTML from input HTML, filtered by parsing the input HTML and by whitelisting of allowed tags and attributes.

Other important methods of the Jsoup class can be found in — https://jsoup.org/apidocs/org/jsoup/Jsoup.html

Document class

This class represents loading an HTML document through the Jsoup library. You can use this class to perform actions that apply to the entire HTML document. Important methods of the Element class can be found in — http://jsoup.org/apidocs/org/jsoup/nodes/Document.html

Element class

An HTML element is made up of tag names, attributes, and child nodes. With the Element class, you can extract data, traverse nodes, and manipulate HTML. Important methods of the Element class can be found in — http://jsoup.org/apidocs/org/jsoup/nodes/Element.html

2. Code engineering

Purpose of the experiment

Implement parsing liuhaihua.cn homepage list

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>springboot-demo</artifactId>
<groupId>com.et</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>jsoup</artifactId>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-autoconfigure</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
</dependency>
</dependencies>
</project>

controller

package com.et.jsoup;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.HashMap;
import java.util.Map;
@RestController
public class HelloWorldController {
@RequestMapping("/hello")
public Map<String, Object> showHelloWorld(){
Map<String, Object> map = new HashMap<>();
map =JsoupUtil.parseHtml("http://www.liuhaihua.cn/");
map.put("msg", "HelloWorld");
return map;
}
}

Utilities

package com.et.jsoup;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.http.HttpEntity;
import org.apache.http.HttpStatus;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.HttpClientUtils;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**
* @author liuhaihua
* @version 1.0
* @ClassName JsoupUtil
* @Description todo
* @date 2024/06/24/ 9:16
*/
public class JsoupUtil {
public static Map<String ,Object> parseHtml(String url){
Map<String,Object> map = new HashMap<>();
CloseableHttpClient httpClient = HttpClients.createDefault();
CloseableHttpResponse response = null;
HttpGet request = new HttpGet(url);
request.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36");
// HttpHost proxy = new HttpHost("60.13.42.232", 9999);
// RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
// request.setConfig(config);
try {
response = httpClient.execute(request);
if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
HttpEntity httpEntity = response.getEntity();
String html = EntityUtils.toString(httpEntity, "utf-8");
System.out.println(html);

Document document = Jsoup.parse(html);
System.out.println(document.getElementsByTag("title").first());
Elements blogmain = document.getElementsByClass("col-sm-8 blog-main");

Elements postItems = blogmain.first().getElementsByClass("fade-in");
List<Map> list = new ArrayList<>();
for (Element postItem : postItems) {
Map<String,Object> row = new HashMap<>();
Elements titleEle = postItem.select(".entry-title a");
row.put("title",titleEle.text());
row.put("href",titleEle.attr("href"));
Elements footEle = postItem.select(".archive-content");
row.put("summary",footEle.text());
Elements view = postItem.select(".views");
System.out.println( view.text());
row.put("views",view.text());
System.out.println("*********************************");
list.add(row);
}
map.put("data",list);
} else {
System.out.println(EntityUtils.toString(response.getEntity(), "utf-8"));
}
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
HttpClientUtils.closeQuietly(response);
HttpClientUtils.closeQuietly(httpClient);
}
return map;
}
public static void main(String[] args) {
parseHtml("http://www.liuhaihua.cn/");
}
}

DemoApplication.java

package com.et.jsoup;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
@SpringBootApplication
public class DemoApplication {
public static void main(String[] args) {
SpringApplication.run(DemoApplication.class, args);
}
}

The above are just some of the key codes, all of which can be found in the repositories below

Code repositories

3. Testing

4. References

--

--

HBLOG
HBLOG

Written by HBLOG

talk is cheap ,show me your code

No responses yet