Search Word Document Using Java Example

In this tutorial, we will examine the possibility to search  a Microsoft Word document  for a text / pattern, using a Java program. We will use Apache POI to parse the Word document and java.util.Scanner to perform a string search, and report the number of occurrences of the input pattern as the output. This tutorial is targeted for beginners, and you can feel free to extend the code provided in this post for your project requirements. The list of steps for this guide is captured below:

Step by Step Guide to Search Word Document in Java
Step by Step Guide to Search Word Document in Java

1.Read Word Document in Java


In order to search the contents of a Word document, it is required to read the document as an object of type java.io.FileInputStream first. The code to do this is straight forward, and is shown below:

                /* Create a FileInputStream object to read the input MS Word Document */
                FileInputStream input_document = new FileInputStream(new File("test_document.doc")); 


2. Parse Document Text Using Apache POI


In this step, we will use WordExtractor, defined in org.apache.poi.hwpf.extractor.WordExtractor to extract the contents of the Word document. To create an object of type WordExtractor, we will pass the FileInputStream object, created in Step – 1. Apache POI has made this class available for all Word to Text conversion necessities. The code to initialize the WordExtractor object is shown below:


                /* Create Word Extractor object */
                WordExtractor my_word=new WordExtractor(input_document);                

 3.Create Scanner / Define Search Pattern


Once you have created the WordExtractor object, you can pass the entire Word document text to the Scanner class, defined in java.util.Scanner by passing the document text as a string, using getText method in WordExtractor. You should define the search pattern at this stage using java.util.regex.Pattern class. This also gives you the power to use regular expressions in your search. For now, let us focus on a simple example. We will count the number of times the word “search” is present in our test document.  The Java code is provided below:

                /* Create Scanner object */             
                Scanner document_scanner = new Scanner(my_word.getText());
                /* Define Search Pattern - we find for the word  "search" */
                Pattern words = Pattern.compile("(search)");


4.Perform Search / Find All Matches


Using the Scanner object created earlier, we iteratively loop through the document text using hasNextLine method. While scanning every line, we use findInLine method and pass the pattern created earlier to see if the search filter is present in the scanned line. We increment the match counter by 1 for a match, otherwise we scan the next line. The search word can be found more than once within a same line, so we use the next method in scanner object to match all occurrences.

                /* Scan through every line */
                while (document_scanner.hasNextLine()) {
                        /* search for the pattern in scanned line */
                        key = document_scanner.findInLine(words);                       
                        while (key != null) {
                                /* Find all matches in the same line */
                                document_scanner.next();                
                                /* Increment counter for the match */
                                count ++;
                                key = document_scanner.findInLine(words);
                        }
                        /* Scan next line in document */
                        document_scanner.nextLine();
                }

5.Output Search Result


You are now ready to output the search result at this stage. It is just a print of the number of times the match was found, using a simple SOP statement.

                        /* Print number of times the search pattern was found */
                        System.out.println("Found Input "+ count + " times");


Search Word Document using Java – Complete Program


The complete code to implement a search functionality in Microsoft Word documents using Java language is shown below. You can treat this as a prototype and extend it further for any search needs.

import java.io.FileInputStream;
import java.io.*;
import org.apache.poi.hwpf.extractor.WordExtractor;
import java.util.Scanner;
import java.util.regex.Pattern;
import java.util.regex.MatchResult;
public class searchWord {  
        public static void main(String[] args) throws Exception{
                /* Create a FileInputStream object to read the input MS Word Document */
                FileInputStream input_document = new FileInputStream(new File("test_document.doc"));
                /* Create Word Extractor object */
                WordExtractor my_word=new WordExtractor(input_document);                
                input_document.close();
                /* Create Scanner object */             
                Scanner document_scanner = new Scanner(my_word.getText());
                /* Define Search Pattern - we find for the word  "search" */
                Pattern words = Pattern.compile("(search)");
                String key;
                int count=0;
                /* Scan through every line */
                while (document_scanner.hasNextLine()) {
                        /* search for the pattern in scanned line */
                        key = document_scanner.findInLine(words);                       
                        while (key != null) {
                                /* Find all matches in the same line */
                                document_scanner.next();                
                                /* Increment counter for the match */
                                count ++;
                                key = document_scanner.findInLine(words);
                        }
                        /* Scan next line in document */
                        document_scanner.nextLine();
                }
                        document_scanner.close();
                        /* Print number of times the search pattern was found */
                        System.out.println("Found Input "+ count + " times");
                
                }
}

I tried this example on my word document and it printed the matching count accurately back on the screen. Give a try, and let us know if you are stuck. Note that to compile this program, would need poi-3.8.jar or equivalent version.You also require poi-scratchpad-3.8.jar. You can download both these from Apache POI distribution.

3 comments:

  1. Is there a way to count number of words in a word document?

    ReplyDelete
  2. Getting below exception while running this code:
    Exception in thread "main" org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

    ReplyDelete
    Replies
    1. Do you have complete example of program that can replace a word with another word of unequal size. I program in another language and learning Java. When I tried it using cache, I am change words of the same size in a doc and docx file. When the replacement size is larger, the file gets corrupted.

      Delete