1. What is jsoup
jsoup is an HTML parser for Java that can directly parse a URL address and HTML text content. It provides a very low-effort API to fetch and manipulate data via DOM, CSS, and jQuery-like manipulation methods, manipulating HTML elements, attributes, and text.
JSoup features
jsoup implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers.
- Extract and parse HTML from URLs, files, or strings.
- Find and extract data, using DOM traversal or CSS selectors.
- Manipulating HTML elements, attributes, and text.
- Clean up user submissions against a secure whitelist to prevent XSS attacks.
- Output clean HTML.
JSoup main class
Most often, given below 3
Classes are what we need to focus on.
Jsoup
The Jsoup class is the entry point to any Jsoup program and will provide methods for loading and parsing HTML documents from a variety of sources. Some of the important methods of the Jsoup class are as follows:
methoddescriptionstatic Connection connect(String url)
A connection that creates and returns a URL.static Document parse(File in, String charsetName)
Parses the specified character set file into a document.static Document parse(String html)
Parse the given HTML code into a document.static String clean(String bodyHtml, Whitelist whitelist)
Returns safe HTML from input HTML, filtered by parsing the input HTML and by whitelisting of allowed tags and attributes.
Other important methods of the Jsoup class can be found in — https://jsoup.org/apidocs/org/jsoup/Jsoup.html
Document class
This class represents loading an HTML document through the Jsoup library. You can use this class to perform actions that apply to the entire HTML document. Important methods of the Element class can be found in — http://jsoup.org/apidocs/org/jsoup/nodes/Document.html 。
Element class
An HTML element is made up of tag names, attributes, and child nodes. With the Element class, you can extract data, traverse nodes, and manipulate HTML. Important methods of the Element class can be found in — http://jsoup.org/apidocs/org/jsoup/nodes/Element.html 。
2. Code engineering
Purpose of the experiment
Implement parsing liuhaihua.cn homepage list
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>springboot-demo</artifactId>
<groupId>com.et</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>jsoup</artifactId>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-autoconfigure</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
</dependency>
</dependencies>
</project>
controller
package com.et.jsoup;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.HashMap;
import java.util.Map;
@RestController
public class HelloWorldController {
@RequestMapping("/hello")
public Map<String, Object> showHelloWorld(){
Map<String, Object> map = new HashMap<>();
map =JsoupUtil.parseHtml("http://www.liuhaihua.cn/");
map.put("msg", "HelloWorld");
return map;
}
}
Utilities
package com.et.jsoup;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.http.HttpEntity;
import org.apache.http.HttpStatus;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.HttpClientUtils;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**
* @author liuhaihua
* @version 1.0
* @ClassName JsoupUtil
* @Description todo
* @date 2024/06/24/ 9:16
*/
public class JsoupUtil {
public static Map<String ,Object> parseHtml(String url){
Map<String,Object> map = new HashMap<>();
CloseableHttpClient httpClient = HttpClients.createDefault();
CloseableHttpResponse response = null;
HttpGet request = new HttpGet(url);
request.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36");
// HttpHost proxy = new HttpHost("60.13.42.232", 9999);
// RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
// request.setConfig(config);
try {
response = httpClient.execute(request);
if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
HttpEntity httpEntity = response.getEntity();
String html = EntityUtils.toString(httpEntity, "utf-8");
System.out.println(html);
Document document = Jsoup.parse(html);
System.out.println(document.getElementsByTag("title").first());
Elements blogmain = document.getElementsByClass("col-sm-8 blog-main");
Elements postItems = blogmain.first().getElementsByClass("fade-in");
List<Map> list = new ArrayList<>();
for (Element postItem : postItems) {
Map<String,Object> row = new HashMap<>();
Elements titleEle = postItem.select(".entry-title a");
row.put("title",titleEle.text());
row.put("href",titleEle.attr("href"));
Elements footEle = postItem.select(".archive-content");
row.put("summary",footEle.text());
Elements view = postItem.select(".views");
System.out.println( view.text());
row.put("views",view.text());
System.out.println("*********************************");
list.add(row);
}
map.put("data",list);
} else {
System.out.println(EntityUtils.toString(response.getEntity(), "utf-8"));
}
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
HttpClientUtils.closeQuietly(response);
HttpClientUtils.closeQuietly(httpClient);
}
return map;
}
public static void main(String[] args) {
parseHtml("http://www.liuhaihua.cn/");
}
}
DemoApplication.java
package com.et.jsoup;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
@SpringBootApplication
public class DemoApplication {
public static void main(String[] args) {
SpringApplication.run(DemoApplication.class, args);
}
}
The above are just some of the key codes, all of which can be found in the repositories below
Code repositories
3. Testing
- Start the Spring Boot application
- Access the http://127.0.0.1:8088/hello and return the parsing result
4. References
- Official Website:https://jsoup.org/
- GitHub:https://github.com/jhy/jsoup/
- http://www.liuhaihua.cn/archives/710776.html