(How) Can I use Bigram Features with the OpenNLP Document Classifier -
(how) can use bigram features opennlp document classifier?
i have collection of short documents (titles, phrases, , sentences), , add bigram features, of kind used in tool libshorttext
http://www.csie.ntu.edu.tw/~cjlin/libshorttext/
is possible?
the documentation explains how using name finder using
bigramnamefeaturegenerator()
and not document classifier
i believe trainer , classifier allow custom featuregenerators in methods, must implemntation of featuregenerator, , bigramfeaturegenerator not impl of that. made quick impl inner class below.. try (untested) code when chance
import java.io.fileinputstream; import java.io.ioexception; import java.io.inputstream; import java.util.arraylist; import java.util.arrays; import java.util.collection; import java.util.collections; import java.util.list; import opennlp.tools.doccat.doccatmodel; import opennlp.tools.doccat.documentcategorizerme; import opennlp.tools.doccat.documentsample; import opennlp.tools.doccat.documentsamplestream; import opennlp.tools.doccat.featuregenerator; import opennlp.tools.util.objectstream; import opennlp.tools.util.plaintextbylinestream; public class doccatusingbigram { public static void main(string[] args) throws ioexception { inputstream datain = new fileinputstream(args[0]); try { objectstream<string> linestream = new plaintextbylinestream(datain, "utf-8"); //here can use part of building model objectstream<documentsample> samplestream = new documentsamplestream(linestream); doccatmodel model = documentcategorizerme.train("en", samplestream, 10, 100, new mybigramfeaturegenerator()); ///now use documentcategorizerme classifier = new documentcategorizerme(model); string[] somedata = "whatever trying classify".split(" "); collection<string> bigrams = new mybigramfeaturegenerator().extractfeatures(somedata); double[] categorize = classifier.categorize(bigrams.toarray(new string[bigrams.size()])); } catch (ioexception e) { // failed read or parse training data, training failed e.printstacktrace(); } } public static class mybigramfeaturegenerator implements featuregenerator { @override public collection<string> extractfeatures(string[] text) { return generate(arrays.aslist(text), 2, ""); } private list<string> generate(list<string> input, int n, string separator) { list<string> outgrams = new arraylist<string>(); (int = 0; < input.size() - (n - 2); i++) { string gram = ""; if ((i + n) <= input.size()) { (int x = i; x < (n + i); x++) { gram += input.get(x) + separator; } gram = gram.substring(0, gram.lastindexof(separator)); outgrams.add(gram); } } return outgrams; } } }
hope helps...
Comments
Post a Comment