Query language for efficient data extraction from Wikipedia
WikipediaQL is an experimental query language and Python library for querying structured data from Wikipedia. It looks like this:
$ wikipedia_ql --page "Guardians of the Galaxy (film)" \
'{
page@title as "title";
section[heading="Cast"] as "cast" >> {
li >> text:matches("^(.+?) as (.+?):") >> {
text-group[group=1] as "actor";
text-group[group=2] as "character"
}
};
section[heading="Critical response"] >> {
sentence:contains("Rotten Tomatoes") as "RT ratings" >> {
text:matches("\d+%") as "percent";
text:matches("(\d+) (critic|review)") >> text-group[group=1] as "reviews";
text:matches("[\d.]+/10") as "overall"
}
}
}'
RT ratings:
overall: 7.8/10
percent: 92%
reviews: '334'
cast:
- actor: Chris Pratt
character: Peter Quill / Star-Lord
- actor: Zoe Saldaña
character: Gamora
- actor: Dave Bautista
character: Drax the Destroyer
- actor: Vin Diesel
character: Groot
- actor: Bradley Cooper
character: Rocket CREDITED ONLY AS "Rocket" PER OFFICIAL SYNOPSIS!
- actor: Lee Pace
character: Ronan the Accuser
- actor: Michael Rooker
character: Yondu Udonta
- actor: Karen Gillan
character: Nebula
- actor: Djimon Hounsou
character: Korath
- actor: John C. Reilly
character: Rhomann Dey
- actor: Glenn Close
character: Irani Rael
- actor: Benicio del Toro
character: Taneleer Tivan / The Collector
title: Guardians of the Galaxy (film)
WikipediaQL-the-tool does roughly this:
The WikipediaQL development is covered in ongoing series of articles: (newest first)
“Sometimes magic is just someone spending more time on something than anyone else might reasonably expect.” — Raymond Joseph Teller
Wikipedia is the most important open knowledge project: basically, the "table of contents" of all the human data. While it might be incomplete or misleading in details, the amount of data is incredible, and its organization makes all the data accessible to humans.
OTOH, the data is semi-structured and quite hard to extract automatically. This project is an experiment in making this data accessible to machines—or, rather, to humans with programming languages. The main goal is to develop an easy to use and memorize, unambiguous and powerful query language and support it by the reference implementation.
See FAQ below for justifications of parsing Wikipedia instead of just using already formalized Wikidata (and of parsing HTML instead of Wikipedia markup).
$ pip install wikipedia_ql
$ wikipedia_ql --page "Page name" query_text
# or
$ wikipedia_ql query_text_with_page
Usage as Python library:
from wikipedia_ql import media_wiki
wikipedia = media_wiki.Wikipedia()
data = wikipedia.query(query_text)
Full WikipediaQL query looks like this:
from <source> {
<selectors>
}
When using --page
parameter to the executable, you need only to pass selectors in the query text.
Source is Wikipedia article name, or category name, or (in the future) other ways of specifying multiple pages. Selectors are similar to CSS; they are nested in one another with selector { other; selectors }
, or (shortcut) selector >> other_selector
. All terminal selectors (e.g., doesn't having others nested) produce values in the output; the value can be associated with a name with as "valuename"
.
See below for a list of selectors and sources currently supported and the future ones.
Simple query for some info from the page:
$ wikipedia_ql --page "Pink Floyd" \
'section[heading="Discography"] >> li >> {
a as "title";
text:matches("\((.+)\)") >> text-group[group=1] as "year";
}'
- title: The Piper at the Gates of Dawn
year: '1967'
- title: A Saucerful of Secrets
year: '1968'
- title: More
year: '1969'
- title: Ummagumma
year: '1969'
# ...and so on...
Multi-page query from pages of some category (only from inside Python):
query = r'''
from category:"2020s American time travel television series" {
page@title as "title";
section[heading="External links"] >> {
li >> text:matches("^(.+?) at IMDb") >> text-group[group=1] >> a@href as "imdb"
}
}
'''
# iquery returns generator, fetching pages as you go
for row in wikipedia.iquery(query):
print(row)
# {'title': 'Agents of S.H.I.E.L.D.', 'imdb': 'https://www.imdb.com/title/tt2364582/'}
# {'title': 'The Flash (2014 TV series)', 'imdb': 'https://www.imdb.com/title/tt3107288/'}
# {'title': 'Legends of Tomorrow', 'imdb': 'https://www.imdb.com/title/tt4532368/'}
# ....
Navigating through pages in one query (note the ->
which means "perform subquery in the page by link"):
$ wikipedia_ql --page Björk \
'section[heading="Discography"] >> li >> a -> {
page@title as "title";
.infobox-image >> img >> @src as "cover"
}'
- cover: https://upload.wikimedia.org/wikipedia/en/thumb/7/77/Bj%C3%B6rk-Debut-1993.png/220px-Bj%C3%B6rk-Debut-1993.png
title: Debut (Björk album)
- cover: https://upload.wikimedia.org/wikipedia/en/thumb/3/3f/Bjork_Post.png/220px-Bjork_Post.png
title: Post (Björk album)
- cover: https://upload.wikimedia.org/wikipedia/en/thumb/a/af/Bj%C3%B6rk_-_Homogenic.png/220px-Bj%C3%B6rk_-_Homogenic.png
title: Homogenic
...
As the page source should be fetched from Wikipedia every time, and it can be a major slowdown when experimenting, wikipedia_ql
implements super-naive caching:
wikipedia = media_wiki.Wikipedia(cache_folder='some/folder')
wikipedia.query('from "Pink Floyd" { page@title }') # fetches page from Wikipedia, then runs query
wikipedia.query('from "Pink Floyd" { page@title }') # gets the cached by prev.request contents from some/folder
(Caution! as it was said, for now, the cache is super-naive: it just stores page contents in the specified folder forever. You might delete it from the cache manually, though: there are just PageName.meta.json
and PageName.json
files.)
from <source> {
<selectors>
}
Source can be:
"Page title"
category:"Category title"
"Page title","Other page title"
(several pages at once)geo:"<lat>, <lng>"
prefix:"Page title prefix"
search:"Search string"
Selectors are CSS-alike selectors, type.class[attr="value"][otherattr="othervalue"]
. Note, that unlike CSS, nesting (any child
inside parent
) is performed not with spaces (parent child
), but with parent >> child
.
a
or table.wikitable
selector nested_selector
is not supported, so you need to li >> a
to say "all links inside the list items"
section
section[heading="Section heading"]
: fetch everything inside section with the specified heading (full heading text must match);section:first
(useful to fetch article intro)section
: all sections;section[level=3]
: all sections of particular levelheading
value patterns would be supported (probably in CSS-alike manner: heading^="Starts from"
and so on)text
text:matches("pattern")
: part of the document matching pattern (Python's regexp); document's structure would be preserved, so you can nest CSS and other WikipediaQL selectors inside: li >> text:matches("^(.+?) as") >> a@href as "link"
text
: without pattern specification, just selects the entire text of the parent element;text:imatches("pattern")
(case-insensitive)alt
attribute as texttext-group
: should be directly nested in text
pattern, refers to capture a group of the regexp; see the first example in the README;
text-group[group=1]
: group by numbertext-group[group="name"]
: named groupssentence
sentence:contains("pattern")
: find sentence where pattern matches (whole sentence is selected)sentence
: all sentences in the scopesentence:first
sentence
page
: refers (from any scope) to the entire current page; useful for re-nesting fetched data in a logical way and to include metadata attributes in output (see below)<selector>@<attribute>
:
<css_selector>@<tag_attribute>
page@<page_attribute>
title
@title
on the top level to fetch "current page's title" instead of page@title
, and so on)table-data
for data tables: from "Kharkiv" { section[heading="Climate"] >> table >> table-data >> tr[title^="Average high"] >> td[column="Jan"] }
; see docs
table-data
quasi-selector, see showcase
hatnote
to fetch and process Hatnotes (special links at the top of the page/section, saying "for more information, see [here]", "this page is about X, if you want Y, go [there]" and so on):has
: like CSS :has
pseudo-class but support all the WikipediaQL selectors, so one might say from category:"Marvel Cinematic Universe films" { :has(@category*="films") { ...work with pages... } }
to drop from the result-set pages of the category which aren't movies.:primary
or something like that (maybe :largest
), to select the most important thing in the scope (for example, section[heading="Discography"] >> ul:primary
will probably fetch the list of albums, while the section might have other, smaller lists, like the enumeration of studios where recordings were done)parent >> child
{ selector1; selector2; selector3 }
fetch all the selectors in the result setparent > child
; maybe (come good reasons) other CSS relations like sibling1 + sibling2
section["Discography"] >> li >> a -> { selectors working inside the fetched page }
, to allow expressing page navigation in a singular queryas "variablename"
: every terminal selector with associated name puts extracted value as {"name": value}
; there is still some uncertainty on how it all should be structured, but mostly the right thing is doneas :type
and as "name":type
for typecasting values
as "year":int
) maybe wouldn't be that necessary, as the conversion can be easily done in the clientas :html
(as opposed to current "content text only" extraction) might be useful in many casesinfobox as :hash
or wikitable as :dataframe
will change the usability of data extraction significantly// text
, as #
can be a start of valid CSS selector)lxml
instead of BeautifulSoup
, and simpler sentence tokenizer)infobox
and wikitable
) would be irrelevant though, and may need a replacement with site-specific onesWikidata is a massive effort to represent Wikipedia in a computable form. But currently, it contains much less data than Wikipedia itself; and much less accessible for investigatory data extraction (TODO: good examples!) While it gets improved constantly, I wanted to tackle the problem from a different angle and see how accessible we can make Wikipedia itself, with all of its semi-structuredness.
Some similar projects (say, wtf_wikipedia) work by fetching page source in Wikitext form, and parsing it for data extraction. This road looks pretty tempting (and for several years, I went it myself with the previous iteration: infoboxer Ruby project). The problem here is that at first sight, Wikitext is better structured: large chunks of data are represented by templates like {{Infobox; field=value, ...}}
so it really seems like a better source for data extraction. There are two huge problems with this approach, though:
{{Infobox city
in a pretty similar form, the eleventh will have {{Geobox capital
with all the different fields and conventions—but in HTML they would render to the similarly-looking <table class="infobox"
. Or, some TV series will represent a list of episodes with just a plain table markup, while the other will use a sophisticated {{Episode list
template. And it all might change with time (some spring cleanup replacing all the template names or converting some regular text to a template). The HTML version is much more stable and predictable.This project is the N-th iteration of the ideas about providing "common knowledge" in a computable form. Most of the previous work was done in Ruby and centered around Reality project; and included, amongst other things, Infoboxer the Wikipedia parser/high-level client, and MediaWiktory the idiomatic low-level MediaWiki client. Some of that work still to be incorporated into WikipediaQL and sister projects.
That project was once inspired by "integrated knowledge" feature of Wolfram Language, I've talked about it (and other topics leading to this project) in a Twitter thread (yes).
The WikipediaQL syntax seems to be subconsciously inspired by qsx selectors language. (By subconsciously I mean I don't remember thinking "Oh, I should do something similar", but the day I've published WikipediQL, past.codes service have reminded me I starred qsx
in December 2020. I started to think about WikipediaQL syntax in June 2021, but there are striking similarities, so it should be related to some indirect inspiration by that project.)
MIT