Hyunwoongko Kss Versions Save

KSS: Korean String processing Suite

v6.0.4

2 weeks ago
  • Reimplement hanja module because it was not able to install in colab environment.
  • Fix information about morpheme analyzer backend in README and docs.

v6.0.2

2 weeks ago
  • Add alias() function and fix some docs.

v6.0.1

2 weeks ago
  • [hotfix] Rename idiom.txt in MANIFEST.in to idioms.txt

v6.0.0

2 weeks ago

KSS: Korean String processing Suite

GitHub release Issues Tests on Ubuntu Tests on MacOS Tests on Windows

KSS is a Korean string processing suite that provides various functions for processing Korean strings. It is designed to be simple and easy to use, and it is designed to be used in various fields such as natural language processing, data preprocessing, and data analysis.

Usage

1. Basic Usage

All functions can be used by creating an instance of the Kss class and calling the instance with the inputs.

from kss import Kss

module = Kss("MODULE_NAME")
output = module("YOUR_INPUT_STRING", **kwargs)

2. Available Modules

If you want to check the available modules, you can use the available() function.

from kss import Kss

Kss.available()
['augment', 'collocate', 'g2p', 'hangulize', 'split_hanja', 'is_hanja', 'hanja2hangul', 'h2j', 'h2hcj', 'j2h', 'j2hcj', 'hcj2h', 'hcj2j', 'is_jamo', 'is_jamo_modern', 'is_hcj', 'is_hcj_modern', 'is_hangul_char', 'select_josa', 'combine_josa', 'extract_keywords', 'split_morphemes', 'paradigm', 'anonymize', 'clean_news', 'is_completed_form', 'get_all_completed_form_hangul_chars', 'get_all_incompleted_form_hangul_chars', 'filter_out', 'half2full', 'reduce_char_repeats', 'reduce_emoticon_repeats', 'remove_invisible_chars', 'normalize', 'preprocess', 'qwerty', 'romanize', 'is_unsafe', 'split_sentences', 'correct_spacing', 'summarize_sentences']

3. Checking the usage of each module

If you want to check the usage of each module, you can use the help() function.

from kss import Kss

module = Kss("split_sentences")
module.help()
Split texts into sentences.

Args:
    text (Union[str, List[str], Tuple[str]]): single text or list/tuple of texts
    backend (str): morpheme analyzer backend. 'mecab', 'pecab', 'punct' are supported
    num_workers (Union[int, str]): the number of multiprocessing workers
    strip (bool): strip all sentences or not
    return_morphemes (bool): whether to return morphemes or not
    ignores (List[str]): list of strings to ignore

Returns:
    Union[List[str], List[List[str]]]: outputs of sentence splitting

Examples:
    >>> from kss import Kss
    >>> split_sentences = Kss("split_sentences")
    >>> text = "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다 강남역 맛집 토끼정의 외부 모습."
    >>> split_sentences(text)
    ['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다', '강남역 맛집 토끼정의 외부 모습.']

4. Multiprocessing

If you input a list of strings, Kss will automatically use multiprocessing to process the strings in parallel. And you can set the number of processes to use by setting the num_workers parameter. If you input num_workers<2, Kss will not use multiprocessing.

from kss import Kss

module = Kss("MODULE_NAME")

# using all cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], **kwargs)
# using 4 cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=4, **kwargs)
# using 1 core (no multiprocessing)
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=1, **kwargs)

5. Backward Compatibility

The old version of Kss used functional usage. KSS also supports this for backward compatibility.

from kss import split_sentences

output = split_sentences("YOUR_INPUT_STRING", **kwargs)

Supported Modules

See here for more details.

v5.2.0

1 month ago
  • Add is_compliable() function to check Cython implementation is available for the user environment.
def is_compilable():
    try:
        # 1. Try to compile csrc/sentence_splitter.cpp
        extra_compile_args, extra_link_args = get_extra_compile_args()
        compiler = new_compiler()
        customize_compiler(compiler)
        compiler.compile(['csrc/sentence_splitter.cpp'], extra_postargs=extra_compile_args)
        return True
    except:
        # 2. Cannot compile csrc/sentence_splitter.cpp
        return False

v5.1.0

1 month ago

The fast backend

If you want to split sentences quickly, you can use the split_sentences function with the backend='fast' option from Kss 5.0.0. This method is based on the fast algorithm utilized in Kss versions prior to 3.0. It offers significantly faster processing compared to the mecab backend, but less accurate. Therefore, This feature could be useful when you need to split sentences very quickly but don't need high accuracy. Furthermore, the fast backend has been implemented in both Python and Cython.

  • If your environment supports the installation of Cython, Kss will use the Cython implementation, which boasts the fastest performance (x600 faster than mecab).
  • Otherwise, it will use the Python implementation, which is slower than the Cython version but faster than the mecab backend (x4 faster than mecab).

Given the substantial speed advantage of the Cython implementation, it is strongly recommended over the Python alternative. Kss automatically detects the availability of Cython in your environment and will install it if feasible, so you don't need to worry about Cython and C++ dependencies.

Accuracy (Normalized F1)

Backend blogs_ko blogs_lee nested sample tweets v_ending wikipedia
mecab 0.8860 0.8887 0.9206 0.9682 0.8137 0.4815 1.0000
fast (Python) 0.6281 0.7899 0.6899 0.7482 0.5315 0.1596 0.7358
fast (Cython) 0.6545 0.8132 0.6372 0.8407 0.5892 0.1596 0.9566

Speed (msec)

Backend blogs_ko blogs_lee nested sample tweets v_ending wikipedia
mecab 538.10 293.31 225.05 56.35 184.91 20.55 899.99
fast (Python) 146.75 70.94 52.84 12.11 37.80 4.69 255.90
fast (Cython) 0.91 0.55 0.46 0.09 0.40 0.05 1.12

Please note that while the core algorithm in the fast backend mirrors that of Kss C++ 1.3.1, several bugs identified in the original implementation have been rectified in Kss 5.0.0.

v4.5.4

10 months ago

v4.5.3

1 year ago

v4.5.2

1 year ago

v4.5.1

1 year ago
  • Hotfix of some bugs in 4.5.0