Natural Language Processing of Chicago news articles
tagnews
is a Python library that can
shapely
python library.Sound interesting? There's example usage below!
You can find the source code on GitHub.
You can install tagnews
with pip:
pip install tagnews
NOTE: You will need to install some NLTK packages as well:
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
Beware, tagnews
requires python >= 3.5.
The main classes are tagnews.CrimeTags
and tagnews.GeoCoder
.
>>> import tagnews
>>> crimetags = tagnews.CrimeTags()
>>> article_text = ('The homicide occurred at the 1700 block of S. Halsted Ave.'
... ' It happened just after midnight. Another person was killed at the'
... ' intersection of 55th and Woodlawn, where a lone gunman')
>>> crimetags.tagtext_proba(article_text)
HOMI 0.739159
VIOL 0.146943
GUNV 0.134798
...
>>> crimetags.tagtext(article_text, prob_thresh=0.5)
['HOMI']
>>> geoextractor = tagnews.GeoCoder()
>>> prob_out = geoextractor.extract_geostring_probs(article_text)
>>> list(zip(*prob_out))
[..., ('at', 0.0044685714), ('the', 0.005466637), ('1700', 0.7173856),
('block', 0.81395197), ('of', 0.82227415), ('S.', 0.7940061),
('Halsted', 0.70529455), ('Ave.', 0.60538065), ...]
>>> geostrings = geoextractor.extract_geostrings(article_text, prob_thresh=0.5)
>>> geostrings
[['1700', 'block', 'of', 'S.', 'Halsted', 'Ave.'], ['55th', 'and', 'Woodlawn,']]
>>> coords, scores = geoextractor.lat_longs_from_geostring_lists(geostrings)
>>> coords
lat long
0 41.859021 -87.646934
1 41.794816 -87.597422
>>> scores # confidence in the lat/longs as returned by pelias, higher is better
array([0.878, 1. ])
>>> geoextractor.community_area_from_coords(coords)
['LOWER WEST SIDE', 'HYDE PARK']
This project uses Machine Learning to automate data cleaning/preparation tasks that would be cost and time prohibitive to perform using people. Like all Machine Learning projects, the results are not perfect, and in some cases may look just plain bad.
We strived to build the best models possible, but perfect accuracy is rarely possible. If you have thoughts on how to do better, please consider reporting an issue, or better yet contributing.
Great question! Please see CONTRIBUTING.md.
If you have problems, please report an issue. Anything that is behaving unexpectedly is an issue, and should be reported. If you are getting bad or unexpected results, that is also an issue, and should be reported. We may not be able to do anything about it, but more data rarely degrades performance.
We want to compare the amount of different types of crimes are reported in certain areas vs. the actual occurrence amount in those areas. In essence, are some crimes under-represented in certain areas but over-represented in others? This is the main question driving the analysis.
This question came from the Chicago Justice Project. They have been interested in answering this question for quite a while, and have been collecting the data necessary to have a data-backed answer. Their efforts include
Most of the code for those components can be found here.
A group actively working on this project meets every Tuesday at Chi Hack Night.