(How) Can I use Bigram Features with the OpenNLP Document Classifier -


(how) can use bigram features opennlp document classifier?

i have collection of short documents (titles, phrases, , sentences), , add bigram features, of kind used in tool libshorttext

http://www.csie.ntu.edu.tw/~cjlin/libshorttext/

is possible?

the documentation explains how using name finder using

bigramnamefeaturegenerator()

and not document classifier

i believe trainer , classifier allow custom featuregenerators in methods, must implemntation of featuregenerator, , bigramfeaturegenerator not impl of that. made quick impl inner class below.. try (untested) code when chance

    import java.io.fileinputstream;     import java.io.ioexception;     import java.io.inputstream;     import java.util.arraylist;     import java.util.arrays;     import java.util.collection;     import java.util.collections;     import java.util.list;     import opennlp.tools.doccat.doccatmodel;     import opennlp.tools.doccat.documentcategorizerme;     import opennlp.tools.doccat.documentsample;     import opennlp.tools.doccat.documentsamplestream;     import opennlp.tools.doccat.featuregenerator;     import opennlp.tools.util.objectstream;     import opennlp.tools.util.plaintextbylinestream;        public class doccatusingbigram {        public static void main(string[] args) throws ioexception {         inputstream datain = new fileinputstream(args[0]);         try {             objectstream<string> linestream =                   new plaintextbylinestream(datain, "utf-8"); //here can use part of building model           objectstream<documentsample> samplestream = new documentsamplestream(linestream);           doccatmodel model = documentcategorizerme.train("en", samplestream, 10, 100, new mybigramfeaturegenerator());             ///now use            documentcategorizerme classifier = new documentcategorizerme(model);           string[] somedata = "whatever trying classify".split(" ");           collection<string> bigrams = new mybigramfeaturegenerator().extractfeatures(somedata);           double[] categorize = classifier.categorize(bigrams.toarray(new string[bigrams.size()]));           } catch (ioexception e) {           // failed read or parse training data, training failed           e.printstacktrace();         }        }        public static class mybigramfeaturegenerator implements featuregenerator {          @override         public collection<string> extractfeatures(string[] text) {           return generate(arrays.aslist(text), 2, "");         }          private  list<string> generate(list<string> input, int n, string separator) {            list<string> outgrams = new arraylist<string>();           (int = 0; < input.size() - (n - 2); i++) {             string gram = "";             if ((i + n) <= input.size()) {               (int x = i; x < (n + i); x++) {                 gram += input.get(x) + separator;               }               gram = gram.substring(0, gram.lastindexof(separator));               outgrams.add(gram);             }           }           return outgrams;         }       }     } 

hope helps...


Comments

Popular posts from this blog

basic authentication with http post params android -

vb.net - Virtual Keyboard commands -

css - Firefox for ubuntu renders wrong colors -