RAKE Java Save

A Java implementation of the Rapid Automatic Keyword Extraction Framework ( RAKE )

Project README

RAKE-Java

A Java 8 implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.

The implementation is based on the python one from https://github.com/aneesha/RAKE (however some changes have been made) The source code is released under the GPL V3License.

Add this repository to your POM.XML whether you want to use it with maven

 <repository>
        <id>galan-maven-repo</id>
        <name>galan-maven-repo-releases</name>
        <url>http://galan.ehu.es/artifactory/ext-release-local</url>
 </repository>

This implementation requires a POS tagger to be used in order to work. For example The Illinois POS tagger could be used for English.

http://cogcomp.cs.illinois.edu/page/software_view/POS

For Spanish or other languages:

FreeLing --> http://nlp.lsi.upc.edu/freeling/

or Standford Pos tagger --> http://nlp.stanford.edu/software/tagger.shtml

The implementation is in beta state

TODO:

 - More testing 

Then an example parser for english that will provide the required data (using Illinois POS Tagger)


    import LBJ2.nlp.SentenceSplitter;
    import LBJ2.nlp.WordSplitter;
    import LBJ2.nlp.seg.PlainToTokenParser;
    import LBJ2.parse.Parser;
    import edu.illinois.cs.cogcomp.lbj.chunk.Chunker;
    import edu.illinois.cs.cogcomp.lbj.pos.POSTagger;
    import edu.ehu.galan.cvalue.model.Token;
     ......

     List<LinkedList<Token>> tokenizedSentenceList;
     List<String> sentenceList;
     POSTagger tagger = new POSTagger();
     Chunker chunker = new Chunker();
     boolean first = true;
     parser = new PlainToTokenParser(new WordSplitter(new SentenceSplitter(pFile)));
     String sentence = "";
     LinkedList<Token> tokenList = null;
     for (LBJ2.nlp.seg.Token word = (LBJ2.nlp.seg.Token) parser.next(); word != null;
            word = (LBJ2.nlp.seg.Token) parser.next()) {
            String chunked = chunker.discreteValue(word);
            tagger.discreteValue(word);
            if (first) {
                tokenList = new LinkedList<>();
                tokenizedSentenceList.add(tokenList);
                first = false;
            }
            tokenList.add(new Token(word.form, word.partOfSpeech, null, chunked));
            sentence = sentence + " " + (word.form);
            if (word.next == null) {
                sentenceList.add(sentence);
                first = true;
                sentence = "";
            }
     }
     parser.reset();
     

Then RAKE can be processed then.....


    Document doc=new Document(full_path,name);
    doc.setSentenceList(sentences);
    doc.setTokenList(tokenized_sentences); 
    RakeAlgorithm ex = new RakeAlgorithm();
    ex.loadStopWordsList("resources/lite/stopWordLists/RakeStopLists/SmartStopListEn");
    ex.loadPunctStopWord("resources/lite/stopWordLists/RakeStopLists/RakePunctDefaultStopList");
    PlainTextDocumentReaderLBJEn parser = new PlainTextDocumentReaderLBJEn();
    parser.readSource("testCorpus/textAstronomy");
    Document doc = new Document("full_path", "name");
    ex.init(doc);
    ex.runAlgorithm();
    doc.getTermList();
Open Source Agenda is not affiliated with "RAKE Java" Project. README Source: Neuw84/RAKE-Java
Stars
28
Open Issues
0
Last Commit
6 years ago
Repository
License

Open Source Agenda Badge

Open Source Agenda Rating