This article is about Indexing and Searching documents with Apache Lucene version 4.7. Before jumping to example and explanation, let's see what Apache Lucene is.
Introduction to Apache Lucene
Lucene is a high-performance, scalable information retrieval (IR) library. IR refers to the process of searching for documents, information within documents, or metadata about documents. Lucene lets you add searching capabilities to your application. [ref. Apache Lucene in Action Second edition covers Apache Lucene v3.0]
The main reason for popularity of Lucene is its simplicity. You don't require in-depth knowledge of indexing and searching process to get started with Lucene. You can start with learning handful of classes which actually do the indexing and searching for Lucene. The latest version released is 4.7 and books are only available for v3.0.
Important note
Lucene is not ready-to-use application like file-search program, web-crawler or search engine. It is a software toolkit or library and with the help of it you can build your own search application or libraries. There are many frameworks build on top of Lucene Core API for searching.
- Eclipse Kepler
- JDK 1.7
- lucene-core-4.7.2.jar
- lucene-queryparser-4.7.2.jar
- lucene-demo-4.7.2.jar
- lucene-analyzers-common-4.7.2.jar
Indexing with Lucene
Let's jump to indexing process in Lucene with example and then we will explain the classes that are used and their purpose.
1. IndexerTest is class used to show the demo.
package lucene.indexer; import java.io.File; import java.io.FileFilter; /** * @author Gaurav Rai Mazra */ public class IndexerTest { public static void main(String[] args) throws Exception { String indexDir = "index"; String dataDir = "dir"; long start = System.currentTimeMillis(); final IndexingHelper indexHelper = new IndexingHelper(indexDir); int numIndexed; try { numIndexed = indexHelper.index(dataDir, new TextFilesFilter()); } finally { indexHelper.close(); } long end = System.currentTimeMillis(); System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds"); } } // class filters only .txt files for indexing class TextFilesFilter implements FileFilter { @Override public boolean accept(File pathname) { return pathname.getName().toLowerCase().endsWith(".txt"); } }
2. IndexingHelper class is used to represent how to do the indexing.
package lucene.indexer; import java.io.File; import java.io.FileFilter; import java.io.FileReader; import java.io.IOException; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; /** * @author Gaurav Rai Mazra */ public class IndexingHelper { //class which actually creates and maintain the indexes in the file private IndexWriter indexWriter; public IndexingHelper(String indexDir) throws Exception { //To represent actual directory Directory directory = FSDirectory.open(new File(indexDir)); //Holds configuration required in creation of IndexWriter object IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_47, new StandardAnalyzer(Version.LUCENE_47)); indexWriter = new IndexWriter(directory, indexWriterConfig); } public void close() throws IOException { indexWriter.close(); } // exposed method to index files public int index(String dataDir, FileFilter fileFilter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f : files) { if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead() && (fileFilter == null || fileFilter.accept(f))) indexFile(f); } return indexWriter.numDocs(); } private void indexFile(File f) throws Exception { System.out.println(" " + f.getCanonicalPath()); Document doc = getDocument(f); indexWriter.addDocument(doc); } private Document getDocument(File f) throws Exception { // class used by lucene indexwriter and indexreader to store and reterive indexed data Document document = new Document(); document.add(new TextField("contents", new FileReader(f))); document.add(new StringField("filename", f.getName(), Field.Store.YES)); document.add(new StringField("fullpath", f.getCanonicalPath(), Field.Store.YES)); return document; } }
In IndexingHelper class, we have used following classes of Lucene library for indexing .txt files.
- IndexWriter class.
- IndexWriterConfig class.
- Directory class.
- FSDirectory class.
- Document class.
Explanation
1. IndexWriter: It is the centeral component of indexing process. This class actually creates new Index or opens the existing one and add, remove and update the document in the index. It has one public constructor which takes Directory
class's object and IndexWriterConfig
class's object as parameters.
This class exposes many methods to add Document
class object to be used internally in Indexing.
This class exposes methods used for deletingDocuments from the index as well and other informative methods like numDocs() which returns all the documents in the index including deleted once if they are not flushed on file.
2. IndexWriterConfig: It holds the configuration required to create IndexWriter
object. It has one public constructor which takes two parameter one is enum of Version i.e. lucene version for compatibility issues. The other parameter is object of Analyzer class which itself is abstract class but have many implementing classes like WhiteSpaceAnalyzer
, StandardAnalyzer
etc. which helps in Analyzing the tokens. It is used in analysis process.
3. Directory: The Directory
class represents the location of Lucene index. It is an abstract class and have many different concrete implementation. No one implementation is best suited for the computer architecture you have. Hence use FSDirectory
abstract class to get best possible concrete implementation available for the Directory
class.
4. Analyzer: Before any text is indexed, it is passed to Analyzer for extracting tokens out of that text that should be indexed and rest will be eliminated.
5. Document: Document
class represents collection of Fields. It is a chunk of data which we want to index and make it retrievable at a later time.
6. Field: Each document will have one or more than one fields. Each field has a name and corresponding to it a value. Most of Field class methods are depreciated. It is favourable to use other existing implementation of Field
class like IntField, LongField, FloatField, DoubleField, BinaryDocValuesField, NumericDocValuesField, SortedDocValuesField, StringField, TextField, StoredField
.
Searching with Lucene
Let's jump to searching with Lucene and then will explain the classes used.
package lucene.searcher; import java.io.File; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; /** * @author Gaurav Rai Mazra */ public class SearcherTest { public static void main(String[] args) throws IOException, ParseException { String indexDir = "index"; String q = "direwolf"; search(indexDir, q); } //Search in lucene index private static void search(String indexDir, String q) throws IOException, ParseException { //get a directory to search from Directory directory = FSDirectory.open(new File(indexDir)); // get reader to read directory IndexReader indexReader = DirectoryReader.open(directory); //create indexSearcher IndexSearcher is = new IndexSearcher(indexReader); // Create analyzer to analyse documents Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47); //create query parser QueryParser queryParser = new QueryParser(Version.LUCENE_47, "contents", analyzer); //get query Query query = queryParser.parse(q); //Query query1 = new TermQuery(new Term("contents", q)); long start = System.currentTimeMillis(); //hit query TopDocs hits = is.search(query, 10); long end = System.currentTimeMillis(); System.err.println("Found " + hits.totalHits + " document(s) in " + (end-start) + " milliseconds"); for (ScoreDoc scoreDoc : hits.scoreDocs) { Document document = is.doc(scoreDoc.doc); System.out.println(document.get("fullpath")); } } }Explanation
1. IndexReader: This is an abstract class providing an interface for assessing an index. For getting particular implementation helper class DirectoryReader
is used which calls open method with passing directory reference to get IndexReader
object.
2. IndexSearcher: IndexSearcher is used to search data which is indexed by IndexWriter
. You can think of IndexSearcher as a class which opens the index in read-only mode. It requires the IndexReader instance to create object of it. It has method to search and getting documents.
3. QueryParser: This class is used to parse the string to generate query out of it.
4. Query: It is abstract class represent the query to be used in searching. There are many concrete classes to it like TermQuery, BooleanQuery, PhraseQuery
etc. It contains several utility method, one of it is setBoost(float).
5. TopDocs: It represents the hit returned by search method of IndexSearcher
. It has one public constructor which take three parameters int totalHits, ScoreDoc[] scoreDocs, float maxScore
. The ScoreDoc
contains the score and documentId of the document.
No comments :
Post a Comment