Package org.apache.lucene.classification
Class BM25NBClassifier
- java.lang.Object
-
- org.apache.lucene.classification.BM25NBClassifier
-
- All Implemented Interfaces:
Classifier<BytesRef>
public class BM25NBClassifier extends java.lang.Object implements Classifier<BytesRef>
A classifier approximating naive bayes classifier by using pure queries on BM25.
-
-
Field Summary
Fields Modifier and Type Field Description private Analyzer
analyzer
Analyzer
to be used for tokenizing unseen input textprivate java.lang.String
classFieldName
name of the field to be used as a class / category outputprivate IndexReader
indexReader
IndexReader
used to access theClassifier
's indexprivate IndexSearcher
indexSearcher
IndexSearcher
to run searches on the index for retrieving frequenciesprivate Query
query
Query
used to eventually filter the document set to be used to classifyprivate java.lang.String[]
textFieldNames
names of the fields to be used as input text
-
Constructor Summary
Constructors Constructor Description BM25NBClassifier(IndexReader indexReader, Analyzer analyzer, Query query, java.lang.String classFieldName, java.lang.String... textFieldNames)
Creates a new NaiveBayes classifier.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ClassificationResult<BytesRef>
assignClass(java.lang.String inputDocument)
Assign a class (with score) to the given text Stringprivate java.util.List<ClassificationResult<BytesRef>>
assignClassNormalizedList(java.lang.String inputDocument)
Calculate probabilities for all classes for a given input textprivate double
calculateLogLikelihood(java.lang.String[] tokens, Term term)
private double
calculateLogPrior(Term term)
java.util.List<ClassificationResult<BytesRef>>
getClasses(java.lang.String text)
Get all the classes (sorted by score, descending) assigned to the given text String.java.util.List<ClassificationResult<BytesRef>>
getClasses(java.lang.String text, int max)
Get the firstmax
classes (sorted by score, descending) assigned to the given text String.private double
getTermProbForClass(Term classTerm, java.lang.String... words)
private java.util.ArrayList<ClassificationResult<BytesRef>>
normClassificationResults(java.util.List<ClassificationResult<BytesRef>> assignedClasses)
Normalize the classification results based on the max score availableprivate java.lang.String[]
tokenize(java.lang.String text)
tokenize aString
on this classifier's text fields and analyzer
-
-
-
Field Detail
-
indexReader
private final IndexReader indexReader
IndexReader
used to access theClassifier
's index
-
textFieldNames
private final java.lang.String[] textFieldNames
names of the fields to be used as input text
-
classFieldName
private final java.lang.String classFieldName
name of the field to be used as a class / category output
-
indexSearcher
private final IndexSearcher indexSearcher
IndexSearcher
to run searches on the index for retrieving frequencies
-
-
Constructor Detail
-
BM25NBClassifier
public BM25NBClassifier(IndexReader indexReader, Analyzer analyzer, Query query, java.lang.String classFieldName, java.lang.String... textFieldNames)
Creates a new NaiveBayes classifier.- Parameters:
indexReader
- the reader on the index to be used for classificationanalyzer
- anAnalyzer
used to analyze unseen textquery
- aQuery
to eventually filter the docs used for training the classifier, ornull
if all the indexed docs should be usedclassFieldName
- the name of the field used as the output for the classifier NOTE: must not be heavely analyzed as the returned class will be a token indexed for this fieldtextFieldNames
- the name of the fields used as the inputs for the classifier, NO boosting supported per field
-
-
Method Detail
-
assignClass
public ClassificationResult<BytesRef> assignClass(java.lang.String inputDocument) throws java.io.IOException
Description copied from interface:Classifier
Assign a class (with score) to the given text String- Specified by:
assignClass
in interfaceClassifier<BytesRef>
- Parameters:
inputDocument
- a String containing text to be classified- Returns:
- a
ClassificationResult
holding assigned class of typeT
and score - Throws:
java.io.IOException
- If there is a low-level I/O error.
-
getClasses
public java.util.List<ClassificationResult<BytesRef>> getClasses(java.lang.String text) throws java.io.IOException
Description copied from interface:Classifier
Get all the classes (sorted by score, descending) assigned to the given text String.- Specified by:
getClasses
in interfaceClassifier<BytesRef>
- Parameters:
text
- a String containing text to be classified- Returns:
- the whole list of
ClassificationResult
, the classes and scores. Returnsnull
if the classifier can't make lists. - Throws:
java.io.IOException
- If there is a low-level I/O error.
-
getClasses
public java.util.List<ClassificationResult<BytesRef>> getClasses(java.lang.String text, int max) throws java.io.IOException
Description copied from interface:Classifier
Get the firstmax
classes (sorted by score, descending) assigned to the given text String.- Specified by:
getClasses
in interfaceClassifier<BytesRef>
- Parameters:
text
- a String containing text to be classifiedmax
- the number of return list elements- Returns:
- the whole list of
ClassificationResult
, the classes and scores. Cut for "max" number of elements. Returnsnull
if the classifier can't make lists. - Throws:
java.io.IOException
- If there is a low-level I/O error.
-
assignClassNormalizedList
private java.util.List<ClassificationResult<BytesRef>> assignClassNormalizedList(java.lang.String inputDocument) throws java.io.IOException
Calculate probabilities for all classes for a given input text- Parameters:
inputDocument
- the input text as aString
- Returns:
- a
List
ofClassificationResult
, one for each existing class - Throws:
java.io.IOException
- if assigning probabilities fails
-
normClassificationResults
private java.util.ArrayList<ClassificationResult<BytesRef>> normClassificationResults(java.util.List<ClassificationResult<BytesRef>> assignedClasses)
Normalize the classification results based on the max score available- Parameters:
assignedClasses
- the list of assigned classes- Returns:
- the normalized results
-
tokenize
private java.lang.String[] tokenize(java.lang.String text) throws java.io.IOException
tokenize aString
on this classifier's text fields and analyzer- Parameters:
text
- theString
representing an input text (to be classified)- Returns:
- a
String
array of the resulting tokens - Throws:
java.io.IOException
- if tokenization fails
-
calculateLogLikelihood
private double calculateLogLikelihood(java.lang.String[] tokens, Term term) throws java.io.IOException
- Throws:
java.io.IOException
-
getTermProbForClass
private double getTermProbForClass(Term classTerm, java.lang.String... words) throws java.io.IOException
- Throws:
java.io.IOException
-
calculateLogPrior
private double calculateLogPrior(Term term) throws java.io.IOException
- Throws:
java.io.IOException
-
-