To update a document you must delete it first, close the index and add it again.
The Analyzer , specfied in the Indexwriter, will extracts the tokens to be indexed.There is a default analyzer for english texts(for multilingual one custom analyzers are needed).Before analyzing is done, the documents like pdf,doc etc are to be parsed.A Term is the basic unit for searching. Similar to the Field object, it consists of a pair of string elements: the name of the field and the value of that field.A term is defined as a pair of <fieldname,text>A term vector is a collection of terms.The inverted index map terms to documents.For each term T , it should store the set of all documents containing that term.So the duty of analyzer is to look for the terms in documents and create a token stream so that they can be mapped.Terms are stored in segments and they are sorted.The term frequency will tell how well that term describes the document contents.But term which appear in many documents are not very useful for filtering.The Kth most frequent term has frequency approx 1/K ie for 100 tokens, the index will contain 50% text.For the indexing strategies : - they can be chosen from- Batch based - like a simple file parsing and sorting-
- BTree - indexing - similar to indexing by file systems and databases - as it is a tree the update can be done in place
- Segment based which is common, created by lots of small indexes
The algorithm used for lucene indexing can be
- indexing a single document and merging a set of indexes
- incremental algorithm in which there will be a stack of segments and new indexes are pushed to stack (segment based)
References
Doug Cutting (the creator) Lecture
http://docs.huihoo.com/apache/apachecon/us2007/AdvancedIndexingLucene.ppt
http://www.im.ntu.edu.tw/~b90003/Lucene.ppt.
Lucene In Action
No comments:
Post a Comment