Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.
Hello again!
@chrisjbryant suggested the use of python-Levenshtein for getting similarity ratios between strings in the comments of pull request #10, and though I've been busy until now, I finally got around to testing it. This pull request has some scripts in the Markdown. They are included in case you want to reproduce my tests and results. However, I would recommend not giving the scripts themselves much attention, and focusing on the explanations and outputs.
I created the following file in the root folder of the project, where it could read from test_values.py
.
#!/usr/bin/env python
# encoding: utf-8
from difflib import SequenceMatcher
from Levenshtein import ratio
import unittest
class TestWordForms(unittest.TestCase):
"""
Simple TestCase for a specific input to output, one instance generated per test case for use in a TestSuite
"""
def __init__(self, text_input: str, expected_output: dict, description: str = ""):
super().__init__()
self.text_input = text_input
self.expected_output = expected_output
self.description = description
def setUp(self):
pass
def tearDown(self):
pass
def runTest(self):
self.assertEqual(
SequenceMatcher(None, self.text_input, self.expected_output).ratio(),
ratio(self.text_input, self.expected_output),
self.description,
)
if __name__ == "__main__":
from test_values import test_values
suite = unittest.TestSuite()
suite.addTests(
TestWordForms(
inp,
value,
f"difflib.SequenceMatcher(None, {repr(inp)}, {repr(value)}) ~= Levenshtein.ratio({repr(inp)}, {repr(value)})",
)
for inp, out in test_values
for values in out.values()
for value in values
)
unittest.TextTestRunner().run(suite)
In short, this takes all input values from test_values.py
, and all individual outputs for these inputs, and checks whether the similarity ratio between these two is identical when using difflib.SequenceMatcher().ratio()
vs Levenshtein.ratio()
. The output is the following:
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 621 tests in 0.080s
OK
So, the Levenshtein.ratio()
is indeed equivalent to difflib.SequenceMatcher().ratio()
for these test cases.
Again, I wrote a quick script for this. None of these are included in the actual commits as they should not be packaged with word_forms
. The script is:
#!/usr/bin/env python
# encoding: utf-8
from timeit import timeit
from difflib import SequenceMatcher
from Levenshtein import ratio
from test_values import test_values
test_cases = [
(
inp,
value,
)
for inp, out in test_values
for values in out.values()
for value in values
]
n = 100
ratio_list = []
for str_one, str_two in test_cases:
diff = timeit(lambda: SequenceMatcher(None, str_one, str_two).ratio(), number=n)
leven = timeit(lambda: ratio(str_one, str_two), number=n)
ratio_list.append(diff / leven)
#print(f"Levenshtein.ratio() is {ratio_list[-1]:.4f} times as fast as difflib.SequenceMatcher().ratio() for {repr(str_one)} to {repr(str_two)}")
print(f"Minimum performance gain (ratio): {min(ratio_list):.4f} times as fast")
print(f"Maximum performance gain (ratio): {max(ratio_list):.4f} times as fast")
print(f"Median performance gain (ratio) : {sorted(ratio_list)[round(len(ratio_list)/2)]:.4f} times as fast")
Which outputted:
Minimum performance gain (ratio): 21.2509 times as fast
Maximum performance gain (ratio): 194.4625 times as fast
Median performance gain (ratio) : 78.2975 times as fast
So, yes, it is much faster. Will it be a noticable performance increase when implemented in word_forms
? Well, I made a quick script for that too.
#!/usr/bin/env python
# encoding: utf-8
from timeit import timeit
from word_forms.word_forms import get_word_forms
from test_values import test_values
n = 200
speed_list = []
for test_case in test_values:
str_one = test_case[0]
speed = timeit(lambda: get_word_forms(str_one), number=n)
speed_list.append(speed)
#print(f"Took {speed:.8f}s")
print(f"Minimum execution time (seconds): {min(speed_list):.8f}s")
print(f"Maximum execution time (seconds): {max(speed_list):.8f}s")
print(f"Median execution time (seconds) : {sorted(speed_list)[round(len(speed_list)/2)]:.8f}s")
Which outputted the following when using difflib.SequenceMatcher().ratio()
: (Note, the execution time is for calling the function 200 times)
Minimum execution time (seconds): 0.01940580s
Maximum execution time (seconds): 1.16317950s
Median execution time (seconds) : 0.07265300s
and the following for Levenshtein.ratio()
:
Minimum execution time (seconds): 0.01647300s
Maximum execution time (seconds): 1.23246420s
Median execution time (seconds) : 0.05827050s
When considering the median, there is a noticable performance increase of some ~20%, but this is likely not enough for any user to actually notice. Regardless, using Levenshtein.ratio()
is definitely preferable, unless you desperately want to keep the amount of dependencies to a minimum.
Thank you for the recommendation @chrisjbryant
All tests still pass, as expected.
word_forms.lemmatizer
. It uses get_word_forms()
to generate all forms of the word and then picks the shortest form that appears first in the dictionary (i.e. in alphabetically sorted order).Unipath
has been replaced by Python3's pathlib
. NLTK
and inflect
versions have been bumped.This release is a result of the awesome work done by @CubieDev.
python test_word_forms.py
.get_word_forms("death") => {
'n': {'death', 'dying', 'dice', 'Death', 'dyings', 'die', 'deaths', 'Deaths'},
...
}
"politicss"
.get_word_forms("verb") => {
'a': {'verbal'},
'n': {'wordings', 'wording', 'word', 'verbs', 'verb', 'words'},
'r': {'verbally'},
'v': {'verbified',
'verbifies',
'verbify',
'verbifying',
'word',
'worded',
'wording',
'words'}
}
This is now:
get_word_forms("verb") => {
"a": {"verbal"},
"n": {"verbs", "verb"},
"r": {"verbally"},
"v": {"verbifying", "verbified", "verbify", "verbifies"},
}
"ran"
, "am"
and "was"
not returning any values. The old system returns:get_word_forms("ran") => {'n': set(), 'a': set(), 'v': set(), 'r': set()}
get_word_forms("run") => {
'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'},
'a': {'running', 'runny'},
'v': {'running', 'run', 'runs', 'ran'},
'r': set()
}
This is now:
get_word_forms("ran") => {
'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'},
'a': {'running', 'runny'},
'v': {'running', 'run', 'runs', 'ran'},
'r': set()
}
get_word_forms("run") => {
'n': {'runner', 'runners', 'run', 'runniness', 'runnings', 'running', 'runninesses', 'runs'},
'a': {'running', 'runny'},
'v': {'running', 'run', 'runs', 'ran'},
'r': set()
}
from time import time
t = time()
from word_forms.word_forms import get_word_forms
print(f"It took {time() - t:.4f}s to load the module")
used to output
It took 10.1868s to load the module
and now it outputs
It took 2.7437s to load the module
In addition to these fixes, this pull request solves issues #2 and #3.
I've split up my work in small commits, so that with each commit the reasoning and changes are concise and hopefully followable. I would heavily recommend going through each commit in order to see what I've done, rather than immediately jumping to see the overall effect they had on the two main files of your project.
I'll go through each of my commits and tell you my reasoning and what they did.
"politicss"
.nltk.corpus.words.words()
, which gives you (if I recall correctly) some 240000 words rather than the previous 180000, and in considerably less time. This is responsible for improvement vi.get_related_lemmas()
is more intuitively implemented as a recursive function. I keep track of a list of known lemmas. When the function is called, it will take the word
parameter, get the lemmas for that word exactly, add those to known_lemmas
if not already present, and recursively call get_related_lemmas()
again with lemmas related to the original word
. Each of these recursive calls will add more entries to known_lemmas
, which is then returned.
Note that at this time (this will change in a later commit), this function is identical in functionality, it just has a different implementation..copy()
on a list rather than using a list comprehension to copy a list.CONJUGATED_VERB_LIST = [
['convoluted', 'convoluting', 'convolute', 'convolutes'],
['fawn', 'fawned', 'fawns', 'fawning']
]
v1 = Verbs({'convoluted', 'convoluting', 'convolute', 'convolutes'})
v2 = Verbs({'convoluted', 'convoluting', 'convolute', 'convolutes'})
CONJUGATED_VERB_DICT = {
'convoluted': v1,
'convoluting': v1,
'convolute': v1,
'convolutes': v1,
'fawn': v2,
'fawned': v2,
'fawns': v2,
'fawning': v2
}
In the old system, we need:
for verb in verb_set:
for word_list in CONJUGATED_VERB_LIST:
if verb in word_list:
for word in word_list:
related_word_dict["v"].add(word)
You can count the amount of nested loops for yourself. Now we only need:
for verb in verb_set:
if verb in CONJUGATED_VERB_DICT:
related_word_dict["v"] |= CONJUGATED_VERB_DICT[verb].verbs
This is considerably faster. Note that |=
is a set union operation.
word_forms
is just slightly optimized to not need a list comprehension.difflib.get_close_matches()
used in constants.py
: difflib.SequenceMatcher
. Now, new lemmas found in get_related_lemmas
will only be considered if they are deemed at least 40% similar to the previous lemma. This will avoid jumps like "verbal" -> "word"
and "genetic" -> "origin"
."css"
to avoid "geneticss"
and "politicss"
. We override this change later in commit 2ea150e.get_word_forms
to singular, we now use the NLTK WordNetLemmatizer()
to lemmatize the input word for nouns, verbs, adjectives and adverbs. With this change, we get "{run}"
as words when the input was "ran"
, or {"be", "am"}
if we input "am"
. Then, for each of these words we call get_related_lemmas
, and duplicates lemmas are removed. This is responsible for improvement v."am"
and "ran"
. The Contribution section is also updated to reflect that a bug was now fixed.The modified program passes all provided tests. The original program will fail 9 of them. The output of the tests on the original program is provided here: original_test_output.txt
I've tested my changes with Python 3.7.4. Whether the program is still Python 2 compatible I do not know.
inflect
can still cause issues, as it comes up with words like "runninesses"
as the plural of "runniness"
.