1. What is tika?
Tika is a content analysis tool that comes with a comprehensive parser tool class, which can parse files in basically all common formats, obtain metadata, content and other content of the file, and return formatting information. Overall, it can be used as a general-purpose parsing tool. In particular, it is of great significance for the data scraping and processing steps of search engines. Tika is a sub-project of Apache’s Lucene project, which can be used to get content from a large number of documents for indexing in lucene applications, which is very convenient and easy to use. The Apache Tika toolkit can automatically detect the types of various documents (such as word, ppt, xml, csv, ppt, etc.) and extract the metadata and text content of the document. Tika integrates with existing document parsing libraries and provides a unified interface that makes parsing different types of documents easier. Tika is very useful for search engine indexing, content analysis, conversions, and more.
Tika architecture
App Builders can easily integrate Tika in their apps. Tika offers a command-line interface and a graphical user interface that makes it more user-friendly. In this chapter, we will discuss the four important modules that make up the Tika architecture. The following diagram shows the architecture of Tika’s four modules:
- Language detection mechanism.
- MIME detection mechanism.
- Parser interface.
- Tika Facade class.
Language detection mechanism
Whenever a text file is passed to Tika, it will detect the language in it. It accepts comments in files without language and adds metadata information in the file by detecting that language. To support language recognition, Tika has a class called Language Identifiers in the package org.apache.tika.language and the Language Recognition Database, which contains an algorithm for language detection from a given text. Internally, Tika uses the N-gram algorithm for language detection.
MIME detection mechanism
Tika can detect document types according to MIME standards. Tika’s default MIME type detection is org.apache.tika.mime.mimeTypes. It uses the org.apache.tika.detect.Detector interface for most content type detection. Internal Tika uses a variety of techniques such as file match substitution, content type hints, magic bytes, character encoding, and a few others.
Resolver interface
The org.apache.tika.parser parser interface is the main interface for Tika to parse documents. The interface extracts text and metadata from the document and summarizes its willingness to write parser plugins to external users. Tika supports a large number of file formats using different specific parser classes, specifically for each document type. These formats are supported by specific classes of different file formats, either through the direct implementation of a logic analyzer or by using an external parser library.
Tika Facade class
The Tika façade class used is the easiest and most straightforward way to call Tika from Java, and it also follows the design pattern of the façade. The façade can be found in the Tika FACADE class in the org.apache.tika package of the Tika API. By implementing the basic use case, Tika acts as a proxy for the facade. It abstracts away the underlying complexities of the Tika library, such as MIME detection mechanisms, parser interfaces, and language detection mechanisms, and provides users with a simple interface to use.
2. Code engineering
Objectives of the experiment
Convert Word document to HTML
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>springboot-demo</artifactId>
<groupId>com.et</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>tika</artifactId>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-autoconfigure</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.17</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</dependency>
</dependencies>
</project>
controller
package com.et.tika.controller;
import com.et.tika.convertor.WordToHtmlConverter;
import com.et.tika.dto.ConvertedDocumentDTO;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;
import java.util.HashMap;
import java.util.Map;
@RestController
@Slf4j
public class HelloWorldController {
@RequestMapping("/hello")
public Map<String, Object> showHelloWorld(){
Map<String, Object> map = new HashMap<>();
map.put("msg", "HelloWorld");
return map;
}
@Autowired
WordToHtmlConverter converter;
/**
* Transforms the Word document into HTML document and returns the transformed document.
*
* @return The content of the uploaded document as HTML.
*/
@RequestMapping(value = "/api/word-to-html", method = RequestMethod.POST)
public ConvertedDocumentDTO convertWordDocumentIntoHtmlDocument(@RequestParam(value = "file", required = true) MultipartFile wordDocument) {
log.info("Converting word document into HTML document");
ConvertedDocumentDTO htmlDocument = converter.convertWordDocumentIntoHtml(wordDocument);
log.info("Converted word document into HTML document.");
log.trace("The created HTML markup looks as follows: {}", htmlDocument);
return htmlDocument;
}
}
WordToHtmlConverter
package com.et.tika.convertor;
import com.et.tika.dto.ConvertedDocumentDTO;
import com.et.tika.exception.DocumentConversionException;
import lombok.extern.slf4j.Slf4j;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;
import org.springframework.web.multipart.MultipartFile;
import org.xml.sax.SAXException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.TransformerException;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamResult;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
/**
*
*/
@Component
@Slf4j
public class WordToHtmlConverter {
/**
* Converts a .docx document into HTML markup. This code
* is based on <a href="http://stackoverflow.com/a/9053258/313554">this StackOverflow</a> answer.
*
* @param wordDocument The converted .docx document.
* @return
*/
public ConvertedDocumentDTO convertWordDocumentIntoHtml(MultipartFile wordDocument) {
log.info("Converting word document: {} into HTML", wordDocument.getOriginalFilename());
try {
InputStream input = wordDocument.getInputStream();
Parser parser = new OOXMLParser();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "utf-8");
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.setResult(new StreamResult(sw));
Metadata metadata = new Metadata();
metadata.add(Metadata.CONTENT_TYPE, "text/html;charset=utf-8");
parser.parse(input, handler, metadata, new ParseContext());
return new ConvertedDocumentDTO(wordDocument.getOriginalFilename(), sw.toString());
}
catch (IOException | SAXException | TransformerException | TikaException ex) {
log.error("Conversion failed because an exception was thrown", ex);
throw new DocumentConversionException(ex.getMessage(), ex);
}
}
}
dto
package com.et.tika.dto;
import org.apache.commons.lang.builder.ToStringBuilder;
/**
*
*/
public class ConvertedDocumentDTO {
private final String contentAsHtml;
private final String filename;
public ConvertedDocumentDTO(String filename, String contentAsHtml) {
this.contentAsHtml = contentAsHtml;
this.filename = filename;
}
public String getContentAsHtml() {
return contentAsHtml;
}
public String getFilename() {
return filename;
}
@Override
public String toString() {
return new ToStringBuilder(this)
.append("filename", this.filename)
.append("contentAsHtml", this.contentAsHtml)
.toString();
}
}
Custom exceptions
package com.et.tika.exception;
/**
*
*/
public final class DocumentConversionException extends RuntimeException {
public DocumentConversionException(String message, Exception ex) {
super(message, ex);
}
}
The above are just some of the key codes, all of which can be found in the repositories below
Code repositories
3. Testing
Start the Spring Boot application