Speedparser is a black-box "style" reimplementation of the
Universal Feed Parser <http://code.google.com/p/feedparser/>_. It uses some feedparser code
for date and authors, but mostly re-implements its data normalization algorithms
based on feedparser output. It uses
lxml for feed parsing and for optional
HTML cleaning. Its compatibility with
feedparser is very good for a strict
subset of fields, but poor for fields outside that subset. See
tests/speedparsertests.py for more information on which fields are more or
less compatible and which are not.
On an Intel(R) Core(TM) i5 750, running only on one core,
2.5 feeds/sec on the test feed set (roughly 4200 "feeds" in
speedparser manages around
with HTML cleaning on and
200 feeds/sec with cleaning off.
pip install speedparser
Usage is similar to feedparser::
>>> import speedparser >>> result = speedparser.parse(feed) >>> result = speedparser.parse(feed, clean_html=False)
There are a few interface differences and many result differences between
speedparser and feedparser. The biggest similarity is that they both return
FeedParserDict() object (with keys accessible as attributes), they both
bozo key when an error is encountered, and various aspects of the
entries keys are likely to be identical or very similar.
speedparser uses different (and in some cases less or none; buyer beware)
data cleaning algorithms than
feedparser. When it is enabled, lxml's
html.cleaner library will be used to clean HTML and give similar but not
identical protection against various attributes and elements. If you supply
Cleaner element to the "
clean_html kwarg, it will be used
speedparser to clean the various attributes of the feed and entries.
speedparser does not attempt to fix character encoding by default because
this processing can take a long time for large feeds. If the encoding value of
the feed is wrong, or if you want this extra level of error tollerance, you
can either use the
chardet module to detect the encoding based on the
document or pass
speedparser.parse and it will fall
back to encoding detection if it encounters encoding errors.
If your application is using
feedparser to consume many feeds at once and
CPU is becoming a bottleneck, you might want to try out
speedparser as an
feedparser as a backup). If you are writing an
application that does not ingest many feeds, or where CPU is not a problem,
you should use
feedparser as it is flexible with bad or malformed data and
has a much better test suite.