Python scripts to import StackExchange data dump into Postgres DB.
This is a quick script to move the Stackoverflow data from the StackExchange data dump (Sept '14) to a Postgres SQL database.
Schema hints are taken from a post on Meta.StackExchange and from StackExchange Data Explorer.
Install requirements, create a new database (e.g. beerSO
below), and use load_into_pg.py
script:
$ pip install -r requirements.txt
...
Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0
$ createdb beerSO
$ python load_into_pg.py -s beer -d beerSO
This will download compressed files from archive.org and load all the tables at once.
You can use a custom database name as well. Make sure to explicitly give it while executing the script later.
Each table data is archived in an XML file. Available tables varies accross
history. load_into_pg.py
knows how to handle the following tables:
Badges
.Posts
.Tags
(not present in earliest dumps).Users
.Votes
.PostLinks
.PostHistory
.Comments
.You can download manually the files to the folder from where the program is
executed: Badges.xml
, Votes.xml
, Posts.xml
, Users.xml
, Tags.xml
. In
some old dumps, the cases in the filenames are different.
Then load each file with e.g. python load_into_pg.py -t Badges
.
After all the initial tables have been created:
$ psql beerSO < ./sql/final_post.sql
For some additional indexes and tables, you can also execute the the following;
$ psql beerSO < ./sql/optional_post.sql
If you give a schema name using the -n
switch, all the tables will be moved
to the given schema. This schema will be created in the script.
The paths are not changed in the final scripts sql/final_post.sql
and
sql/optional_post.sql
. To run them, first set the search_path to your
schema name: SET search_path TO <myschema>;
Body
field in Posts
table is NOT populated by default. You have to use --with-post-body
argument to include it.EmailHash
field in Users
table is NOT populated.tags.xml
is missing from the data dump. Hence, the PostTag
and UserTagQA
tables will be empty after final_post.sql
.ViewCount
in Posts
is sometimes equal to an empty
value. It is replaced by NULL
in those cases.