Jumanpp Versions Save

Juman++ (a Morphological Analyzer Toolkit)

v2.0.0-rc4

7 months ago

Highlights

  • Improved hash function which has better IPC
  • Fixes for modern compilers/distributions

v2.0.0-rc3

4 years ago

Juman++ Version: 2.0.0-rc3 / Dictionary: 20190731-356e143 / LM: K:20190430-7d143fb L:20181122-b409be68 F:20171214-9d125cb

What's new

  • WARNING: models are not compatible with binaries of previous versions. On the other hand, they are compatible with the master branch now.
  • Check that statically-generated inference code uses compatible model
  • Protobuf-based output formats (optional, requires protobuf 3.0+ installed)
  • Use https://github.com/s-yata/darts-clone as trie implementation, trie index size is 2 times smaller now
  • Can now write definitions for models using using text files, not just C++ DSL

Jumandic-specific

  • Escape bad characters for JUMAN/lattice output formats
  • Fix kaomoji problem breaking brackets (#97)
  • Corpus fixes
  • Analysis fixes by partial annotations
  • Added reading field to aliasing set (but don't trust the reading results in analysis very much, our corpora are not clean for those annotations)

JUMAN output format now escapes following characters: <>" , and tab character. Tabs are replaced by \t, other characters are replaced by their full-width counterparts. For the replaced characters we output 元半角 tag in the feature field. Lattice output format escapes only tabs. Protobuf output formats don't escape anything.

Example:

スペース が好きだ
スペース すぺーす スペース 名詞 6 普通名詞 1 * 0 * 0 "代表表記:スペース/すぺーす カテゴリ:場所-その他"
      特殊 1 空白 6 * 0 * 0 "代表表記:S/* 元半角"
が が が 助詞 9 格助詞 1 * 0 * 0 NIL
好きだ すきだ 好きだ 形容詞 3 * 0 ナ形容詞 21 基本形 2 "代表表記:好きだ/すきだ 反義:形容詞:嫌いだ/きらいだ 動詞派生:好く/すく"

When is the final release?

  • We need to clean up training corpora somewhat and update our RNN model

v2.0.0-rc2

6 years ago

Here is a second pre-release of Juman++V2. The main focus was to get non-core corpora (e.g. web blog text) analysis more stable.

There should not be more serious features or modifications before the next non-rc release, but we want to fix some dictionary inconsistencies before making the final release.

New Features

  • Windows support! Big thanks to @DoumanAsh! Vista+, XP is NOT supported. Builds with MSVC 2017 and gcc-mingw64 (we are testing those platforms on the internal CI), probably should build with MSVC 2015, but I haven't tried. No binaries yet, but you can help us by creating an installer.
  • Can now output to file with -o or --output.
  • --segment now outputs a space-delimited segmentation result without other information. You can also change the delimiter with --segment-separator flag.
  • --partial-input treats input as partially annotated and tries to produce analysis result with restrictions specified by partial annotation.
  • --auto-nbest automatically changes beam widths (local, global left) and lattice output size depending on the input length.

Model Stability

Models should be significantly more robust for analyzing random web text than earlier.

v2.0.0-rc1

6 years ago

A first public preview of Juman++v2

Notable changes

  • Complete rewrite of Juman++
  • Improved analysis speed (>100x) versus v1, rnn models should take about ~1.8 as much as plain juman.
  • Improved model accuracy on Kyoto Corpus and KWDLC
  • Reduced model size
  • Reduced memory usage at analysis time
  • Juman++ is now can be used as a library (examples will come later)
  • Improved emoji support
% jumanpp
おめでとう🎉㊗️23歳かぁ〜若い〜✧
おめでとう おめでとう おめでとう 感動詞 12 * 0 * 0 * 0 "代表表記:おめでとう/おめでとう"
🎉 🎉 🎉 特殊 1 記号 5 * 0 * 0 "代表表記:🎉/* 絵文字種類:ACTIVITIES:EVENT 絵文字:PARTY_POPPER"
㊗️ ㊗️ ㊗️ 特殊 1 記号 5 * 0 * 0 "代表表記:㊗️/* 絵文字種類:SYMBOLS:ALPHANUM 絵文字:JAPANESE_CONGRATULATIONS_BUTTON"
23 23 23 名詞 6 数詞 7 * 0 * 0 "カテゴリ:数量 未知語:数字"
歳 さい 歳 接尾辞 14 名詞性名詞助数辞 3 * 0 * 0 "代表表記:歳/さい 準内容語"
かぁ〜 か か 助詞 9 接続助詞 3 * 0 * 0 "非標準表記:DPSL"
若い わかい 若い 形容詞 3 * 0 イ形容詞アウオ段 18 基本形 2 "代表表記:若い/わかい"
〜 〜 〜 特殊 1 記号 5 * 0 * 0 NIL
✧ ✧ ✧ 未定義語 15 その他 1 * 0 * 0 "未知語:その他 品詞推定:特殊"
EOS
  • Improved kaomoji support (thanks to neologd/unidic for this)

Breaking changes

  • In lattice output format, nodes have continious numbering.
  • Score values are considerably higher than in V1 (can see them in lattice output)
  • V2 doesn’t escape tabs and (half-width) spaces in all output formats (WONTFIX)
    • Generally, text-based output formats require your input not to contain half-width characters
    • There will be protobuf-based binary output formats which can handle such cases and they should be preferred for general text analysis
  • Juman++v2 can lie about readings if the nodes are non-distunguishable (WONTFIX until is Kyoto corpus is reading-annotated), 代表表記 are always correct.
辛い からい 辛い 形容詞 3 * 0 イ形容詞アウオ段 18 基本形 2 "代表表記:辛い/からい 反義:形容詞:甘い/あまい"
@ 辛い *からい* 辛い 形容詞 3 * 0 イ形容詞アウオ段 18 基本形 2 "代表表記:辛い/*つらい*"
こと こと こと 名詞 6 形式名詞 8 * 0 * 0 NIL
だ だ だ 判定詞 4 * 0 判定詞 25 基本形 2 NIL
EOS

Known issues

  • Provided model is not robust enough when analyzing spoken language with default settings. We hope to fix this problem before the main release. Please report such cases to twitter with #jumanpp hashtag.
% jumanpp
いろいろカスタマイズできてよさそうです
いろいろ いろいろ いろいろ 副詞 8 * 0 * 0 * 0 "代表表記:色々/いろいろ"
カスタマイズ カスタマイズ カスタマイズ 名詞 6 普通名詞 1 * 0 * 0 "自動獲得:Wikipedia Wikipediaリダイレクト:カスタム"
できて できて できる 動詞 2 * 0 母音動詞 1 タ系連用テ形 14 "代表表記:出来る/できる"
よ よ る 接尾辞 14 動詞性接尾辞 7 母音動詞 1 文語命令形 18 "代表表記:る/る"
さ さ する 接尾辞 14 動詞性接尾辞 7 サ変動詞 16 未然形 3 "代表表記:する/する"
そうです そうです そうだ 助動詞 5 * 0 助動詞そうだ型 29 デス列基本形 5 NIL
EOS
% jumanpp --global-beam 15
いろいろカスタマイズできてよさそうです
いろいろ いろいろ いろいろ 副詞 8 * 0 * 0 * 0 "代表表記:色々/いろいろ"
カスタマイズ カスタマイズ カスタマイズ 名詞 6 普通名詞 1 * 0 * 0 "自動獲得:Wikipedia Wikipediaリダイレクト:カスタム"
できて できて できる 動詞 2 * 0 母音動詞 1 タ系連用テ形 14 "代表表記:出来る/できる"
よ よ よい 形容詞 3 * 0 イ形容詞アウオ段 18 語幹 1 "代表表記:良い/よい 反義:形容詞:悪い/わるい"
さ さ さ 接尾辞 14 名詞性述語接尾辞 1 * 0 * 0 "代表表記:さ/さ カテゴリ:抽象物;数量 準内容語"
そうです そうです そうだ 接尾辞 14 形容詞性述語接尾辞 5 ナ形容詞 21 デス列基本形 29 "代表表記:そうだ/そうだ"
EOS