Textgain Grasp Save

Essential NLP & ML, short & fast pure Python code

Project README

Grasp.py – Explainable AI

Grasp is a lightweight AI toolkit for Python, with tools for data mining, natural language processing (NLP), machine learning (ML) and network analysis. It has 300+ fast and essential algorithms, with ~25 lines of code per function, self-explanatory function names, no dependencies, bundled into one well-documented file: grasp.py (250KB). Or install with pip, including language models (25MB):

$ pip install git+https://github.com/textgain/grasp

Tools for Data Mining

Download stuff with download(url) (or dl), with built-in caching and logging:

src = dl('https://www.textgain.com', cached=True)

Parse HTML with dom(html) into an Element tree and search it with CSS Selectors:

for e in dom(src)('a[href^="http"]'): # external links
    print(e.href)

Strip HTML with plain(Element) to get a plain text string:

for word, count in wc(plain(dom(src))).items():
    print(word, count)

Find articles with wikipedia(str), in HTML:

for e in dom(wikipedia('cat', language='en'))('p'):
    print(plain(e))

Find opinions with twitter.seach(str):

for tweet in first(10, twitter.search('from:textgain')): # latest 10
    print(tweet.id, tweet.text, tweet.date)

Deploy APIs with App. Works with WSGI and Nginx:

app = App()
@app.route('/')
def index(*path, **query):
    return 'Hi! %s %s' % (path, query)
app.run('127.0.0.1', 8080, debug=True)

Once this app is up, go check http://127.0.0.1:8080/app?q=cat.

Tools for Natural Language Processing

Get language with lang(str) for 40+ languages and ~92.5% accuracy:

print(lang('The cat sat on the mat.')) # {'en': 0.99}

Get locations with loc(str) for 25K+ EU cities:

print(loc('The cat lives in Catena.')) # {('Catena', 'IT', 43.8, 11.0): 1}

Get words & sentences with tok(str) (tokenize) at ~125K words/sec:

print(tok("Mr. etc. aren't sentence breaks! ;) This is:.", language='en'))

Get word polarity with pov(str) (point-of-view). Is it a positive or negative opinion?

print(pov(tok('Nice!', language='en'))) # +0.6
print(pov(tok('Dumb.', language='en'))) # -0.4
  • For de, en, es, fr, nl, with ~75% accuracy.
  • You'll need the language models in grasp/lm.

Tag word types with tag(str) in 10+ languages using robust ML models from UD:

for word, pos in tag(tok('The cat sat on the mat.'), language='en'):
    print(word, pos)
  • Parts-of-speech include NOUN, VERB, ADJ, ADV, DET, PRON, PREP, ...
  • For ar, da, de, en, es, fr, it, nl, no, pl, pt, ru, sv, tr, with ~95% accuracy.
  • You'll need the language models in grasp/lm.

Tag keywords with trie, a compiled dict that scans ~250K words/sec:

t = trie({'cat*': 1, 'mat' : 2})
for i, j, k, v in t.search('Cats love catnip.', etc='*'):
    print(i, j, k, v)

Get answers with gpt(). You'll need an OpenAI API key.

print(gpt("Why do cats sit on mats? (you're a psychologist)", key='...'))

Tools for Machine Learning

Machine Learning (ML) algorithms learn by example. If you show them 10K spam and 10K real emails (i.e., train a model), they can predict whether other emails are also spam or not.

Each training example is a {feature: weight} dict with a label. For text, the features could be words, the weights could be word count, and the label might be real or spam.

Quantify text with vec(str) (vectorize) into a {feature: weight} dict:

v1 = vec('I love cats! 😀', features=('c3', 'w1'))
v2 = vec('I hate cats! 😡', features=('c3', 'w1'))
  • c1, c2, c3 count consecutive characters. For c2, cats → 1x ca, 1x at, 1x ts.
  • w1, w2, w3 count consecutive words.

Train models with fit(examples), save as JSON, predict labels:

m = fit([(v1, '+'), (v2, '-')], model=Perceptron) # DecisionTree, KNN, ...
m.save('opinion.json')
m = fit(open('opinion.json'))
print(m.predict(vec('She hates dogs.')) # {'+': 0.4: , '-': 0.6}

Once trained, Model.predict(vector) returns a dict with label probabilities (0.0–1.0).

Tools for Network Analysis

Map networks with Graph, a {node1: {node2: weight}} dict subclass:

g = Graph(directed=True)
g.add('a', 'b') # a → b
g.add('b', 'c') # b → c
g.add('b', 'd') # b → d
g.add('c', 'd') # c → d
print(g.sp('a', 'd')) # shortest path: a → b → d
print(top(pagerank(g))) # strongest node: d, 0.8

See networks with viz(graph):

with open('g.html', 'w') as f:
    f.write(viz(g, src='graph.js'))

You'll need to set src to the grasp/graph.js lib.

Tools for Comfort

Easy date handling with date(v), where v is an int, a str, or another date:

print(date('Mon Jan 31 10:00:00 +0000 2000', format='%Y-%m-%d'))

Easy path handling with cd(...), which always points to the script's folder:

print(cd('kb', 'en-loc.csv')

Easy CSV handling with csv([path]), a list of lists of values:

for code, country, _, _, _, _, _ in csv(cd('kb', 'en-loc.csv')):
    print(code, country)
data = csv()
data.append(('cat', 'Kitty'))
data.append(('cat', 'Simba'))
data.save(cd('cats.csv'))

Tools for Good

A challenge in AI is bias introduced by human trainers. Remember the Model trained earlier? Grasp has tools to explain how & why it makes decisions:

print(explain(vec('She hates dogs.'), m)) # why so negative?

In the returned dict, the model's explanation is: “you wrote hat + ate (hate)”.

Open Source Agenda is not affiliated with "Textgain Grasp" Project. README Source: textgain/grasp