In this tutorial, we will examine the possibility to search a Microsoft Word document for a text / pattern, using a Java program. We will use
Apache POI
to parse the Word document and java.util.Scanner
to perform a string search, and report the number of occurrences of the input pattern as the output. This tutorial is targeted for beginners, and you can feel free to extend the code provided in this post for your project requirements. The list of steps for this guide is captured below:Step by Step Guide to Search Word Document in Java |
1.Read Word Document in Java
In order to search the contents of a Word document, it is required to read the document as an object of type
java.io.FileInputStream
first. The code to do this is straight forward, and is shown below: /* Create a FileInputStream object to read the input MS Word Document */
FileInputStream input_document = new FileInputStream(new File("test_document.doc"));
2. Parse Document Text Using Apache POI
In this step, we will use
WordExtractor
, defined in org.apache.poi.hwpf.extractor.WordExtractor
to extract the contents of the Word document. To create an object of type WordExtractor, we will pass the FileInputStream object, created in Step – 1. Apache POI has made this class available for all Word to Text conversion necessities. The code to initialize the WordExtractor object is shown below: /* Create Word Extractor object */
WordExtractor my_word=new WordExtractor(input_document);
3.Create Scanner / Define Search Pattern
Once you have created the WordExtractor object, you can pass the entire Word document text to the
Scanner
class, defined in java.util.Scanner
by passing the document text as a string, using getText
method in WordExtractor. You should define the search pattern at this stage using java.util.regex.Pattern
class. This also gives you the power to use regular expressions in your search. For now, let us focus on a simple example. We will count the number of times the word “search” is present in our test document. The Java code is provided below: /* Create Scanner object */
Scanner document_scanner = new Scanner(my_word.getText());
/* Define Search Pattern - we find for the word "search" */
Pattern words = Pattern.compile("(search)");
4.Perform Search / Find All Matches
Using the Scanner object created earlier, we iteratively loop through the document text using
hasNextLine
method. While scanning every line, we use findInLine
method and pass the pattern created earlier to see if the search filter is present in the scanned line. We increment the match counter by 1 for a match, otherwise we scan the next line. The search word can be found more than once within a same line, so we use the next
method in scanner object to match all occurrences. /* Scan through every line */
while (document_scanner.hasNextLine()) {
/* search for the pattern in scanned line */
key = document_scanner.findInLine(words);
while (key != null) {
/* Find all matches in the same line */
document_scanner.next();
/* Increment counter for the match */
count ++;
key = document_scanner.findInLine(words);
}
/* Scan next line in document */
document_scanner.nextLine();
}
5.Output Search Result
You are now ready to output the search result at this stage. It is just a print of the number of times the match was found, using a simple SOP statement.
/* Print number of times the search pattern was found */
System.out.println("Found Input "+ count + " times");
Search Word Document using Java – Complete Program
The complete code to implement a search functionality in Microsoft Word documents using Java language is shown below. You can treat this as a prototype and extend it further for any search needs.
import java.io.FileInputStream;
import java.io.*;
import org.apache.poi.hwpf.extractor.WordExtractor;
import java.util.Scanner;
import java.util.regex.Pattern;
import java.util.regex.MatchResult;
public class searchWord {
public static void main(String[] args) throws Exception{
/* Create a FileInputStream object to read the input MS Word Document */
FileInputStream input_document = new FileInputStream(new File("test_document.doc"));
/* Create Word Extractor object */
WordExtractor my_word=new WordExtractor(input_document);
input_document.close();
/* Create Scanner object */
Scanner document_scanner = new Scanner(my_word.getText());
/* Define Search Pattern - we find for the word "search" */
Pattern words = Pattern.compile("(search)");
String key;
int count=0;
/* Scan through every line */
while (document_scanner.hasNextLine()) {
/* search for the pattern in scanned line */
key = document_scanner.findInLine(words);
while (key != null) {
/* Find all matches in the same line */
document_scanner.next();
/* Increment counter for the match */
count ++;
key = document_scanner.findInLine(words);
}
/* Scan next line in document */
document_scanner.nextLine();
}
document_scanner.close();
/* Print number of times the search pattern was found */
System.out.println("Found Input "+ count + " times");
}
}
I tried this example on my word document and it printed the matching count accurately back on the screen. Give a try, and let us know if you are stuck. Note that to compile this program, would need poi-3.8.jar
or equivalent version.You also require poi-scratchpad-3.8.jar
. You can download both these from Apache POI distribution.
Is there a way to count number of words in a word document?
ReplyDeleteGetting below exception while running this code:
ReplyDeleteException in thread "main" org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
Do you have complete example of program that can replace a word with another word of unequal size. I program in another language and learning Java. When I tried it using cache, I am change words of the same size in a doc and docx file. When the replacement size is larger, the file gets corrupted.
Delete