一、hanlp introduce
HanLP is an NLP toolkit composed of a series of models and algorithms, with the goal of popularizing the application of natural language processing in production environments. HanLP has the characteristics of complete functions, efficient performance, clear structure, up-to-date corpus, and customizability. The internal algorithms have been tested by industry and academia, and the supporting book “Introduction to Natural Language Processing” has been published. Currently, HanLP 2.x based on deep learning has been officially released. The next generation of the most advanced NLP technology supports joint tasks in 104 languages including Simplified and Traditional Chinese, English, Japanese, Russian, French and German.
HanLP provides the following functions:
Chinese word segmentation
HMM-Bigram(Best balance of speed and accuracy; 100 MB of memory)
- Shortest path participle, N-shortest path participle
Construct words from words (focus on accuracy, the world’s largest corpus, can recognize new words; suitable for NLP tasks)
- Perceptron word segmentation, CRF word segmentation
Dictionary word segmentation (focusing on speed, tens of millions of characters per second; saving memory)
- Extremely fast dictionary word segmentation
All segmentation support:
- Index full split mode
- User-defined dictionary
- Compatible with Traditional Chinese
- Train users’ own domain models
part-of-speech tagging
- HMM POS tagging (fast)
- Perceptron part-of-speech tagging, CRF part-of-speech tagging (high accuracy)
Named entity recognition
- Named entity recognition based on HMM role annotation (fast)
- Chinese name recognition, transliterated name recognition, Japanese name recognition, place name recognition, entity name recognition
Named entity recognition based on linear model (high accuracy) - Perceptron named entity recognition, CRF named entity recognition
Keyword extraction
- TextRankKeyword extraction
automatic summary
- TextRank automatic summary
phrase extraction
- Phrase extraction based on mutual information and left and right information entropy
Pinyin conversion
- Polyphones, initials, finals, tones
Conversion between Simplified and Traditional Chinese
- Differing words between Simplified and Traditional Chinese (Simplified Chinese, Traditional Chinese, Taiwan Traditional Chinese, Hong Kong Traditional Chinese)
Text recommendation
- Semantic recommendation, pinyin recommendation, word recommendation
Dependency syntax analysis
- High performance dependency parser based on neural network
- Column search dependency parser based on ArcEager transfer system
Text Categorization
- emotion analysis
text clustering
- KMeans、Repeated Bisection、Automatically infer the number of clusters k
word2vec
- Word vector training, loading, word similarity calculation, semantic operation, query, KMeans clustering
- Document semantic similarity calculation
corpus tools
- Some default models are trained from small corpora, and users are encouraged to train by themselves. All modules provide training interfaces, and the corpus can refer to the 1998 People’s Daily corpus.
While providing rich functions, HanLP’s internal modules adhere to low coupling, models adhere to lazy loading, services adhere to static provision, and dictionaries adhere to clear text release, which is very convenient to use. The default model is trained from the world’s largest Chinese corpus, and it also comes with some corpus processing tools to help users train their own models.
二、download and config
一、Maven
For the convenience of users, a Portable version with built-in data packages is provided. Just add the following in pom.xml:
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.8.4</version>
</dependency>
With zero configuration, you can use basic functions (all functions except word formation and dependency syntax analysis). If the user has customized needs, you can refer to method 2 and use hanlp.properties for configuration (the Portable version also supports hanlp.properties)
二、download jar、data、hanlp.properties
HanLP separates data from programs and gives users the freedom to customize.
1、download:data.zip
After downloading, unzip it to any directory, and then tell HanLP the location of the data packet through the configuration file.
The data in HanLP is divided into dictionaries and models, where dictionaries are required for lexical analysis and models are required for syntactic analysis.
data
│
├─dictionary
└─model
Users can add, delete, and replace by themselves. If functions such as syntactic analysis are not needed, the model folder can be deleted at any time.
There is no absolute difference between a model and a dictionary. The fact that the hidden horse model is made into a dictionary that everyone can edit does not mean that it is not a model.
The GitHub code base already contains the dictionary in data.zip, which can be directly compiled and run with automatic caching; the model requires additional downloading.
2、download jar and config file:hanlp-release.zip
The function of the configuration file is to tell HanLP the location of the data packet, just modify the first line
root=D:/JavaProjects/HanLP/
It can be the parent directory of data. For example, the data directory is /Users/hankcs/Documents/data, then root=/Users/hankcs/Documents/.
Finally, just put hanlp.properties into the classpath. For most projects, it can be placed in the src or resources directory. The IDE will automatically copy it to the classpath during compilation. In addition to the configuration file, you can also use the environment variable HANLP_ROOT to set root. For Android projects, please refer to the demo.
If placed incorrectly, HanLP will prompt for the appropriate path in the current environment and try to read the data set from the project root directory.。
三、code project
this use the first method ,you can try the second method if you are interest in it
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>springboot-demo</artifactId>
<groupId>com.et</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>hanlp-demo</artifactId>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-autoconfigure</artifactId>
</dependency>
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.8.4</version>
</dependency>
</dependencies>
</project>
application.yaml
server:
port: 8088
DemoApplication.java
package demo.et.hanlp.demo;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
@SpringBootApplication
public class DemoApplication {
public static void main(String[] args) {
SpringApplication.run(DemoApplication.class, args);
}
}
code repository
四、test
package com.et.hanlp.demo;
import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie;
import com.hankcs.hanlp.dictionary.CoreDictionary;
import com.hankcs.hanlp.dictionary.CustomDictionary;
/**
* 演示用户词典的动态增删
*
* @author hankcs
*/
public class DemoCustomDictionary
{
public static void main(String[] args)
{
// 动态增加
CustomDictionary.add("攻城狮");
// CustomDictionary.add("小金保");
// 强行插入
CustomDictionary.insert("白富美", "nz 1024");
// 删除词语(注释掉试试)
// CustomDictionary.remove("攻城狮");
System.out.println(CustomDictionary.add("单身狗", "nz 1024 n 1"));
System.out.println(CustomDictionary.get("单身狗"));
String text = "攻城狮逆袭单身狗,迎娶白富美,走上人生巅峰,小金保值得你需要"; // 怎么可能噗哈哈!
// AhoCorasickDoubleArrayTrie自动机扫描文本中出现的自定义词语
final char[] charArray = text.toCharArray();
CustomDictionary.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit<CoreDictionary.Attribute>()
{
@Override
public void hit(int begin, int end, CoreDictionary.Attribute value)
{
System.out.printf("[%d:%d]=%s %s\n", begin, end, new String(charArray, begin, end - begin), value);
}
});
System.out.println("########################################");
// 自定义词典在所有分词器中都有效
System.out.println(HanLP.segment(text));
}
}
the result as follows:
true
nz 1024 n 1
[0:3]=攻城狮 nz 1
[5:8]=单身狗 nz 1024 n 1
[11:14]=白富美 nz 1024
[0:2]=攻城 vi 15
[3:5]=逆袭 nz 199
########################################
[攻城狮/nz, 逆袭/nz, 单身狗/nz, ,/w, 迎娶/v, 白富美/nz, ,/w, 走/v, 上/f, 人生/n, 巅峰/n]