Stackexchange (e.g., stackoverflow) data dump converter from XML to CSV format.
CLI tool that allows you to convert Stack Exchange data dumps from XML
to CSV
or JSON
formats, which is more suitable for importing to the different databases.
Here you can find the examples of the schema for the different databases:
Before, ensure that you have:
go version
command. It should display the current version of the compiler..7z
files. Possible candidate is 7z.Choose and download the database dump that you are going to convert.
Important: Stackoverflow dump stored in 8 separated 7z archives:
Extract archive(s) content file(s) to the directory from where you will convert XML files.
Example with academia.stackexchange.com.7z dump:
$ mkdir xml csv
$ 7z e academia.stackexchange.com.7z -oxml
$ ls xml/
Badges.xml Comments.xml PostHistory.xml PostLinks.xml Posts.xml Tags.xml Users.xml Votes.xml
Clone & build stackexchange-xml-converter
converter:
$ git clone https://github.com/SkobelevIgor/stackexchange-xml-converter
$ cd stackexchange-xml-converter/
$ go build
Now you have the stackexchange-xml-converter
executable file. Let’s convert XML files to the CSV format:
./stackexchange-xml-converter -result-format=csv -source-path=../xml -store-to-dir=../csv
result-format
(Required) Result format (csv or json)source-path
(Required) Absolute or relative path to the directory with an XML file(s) or to the separate XML file.store-to-dir
(Optional) Absolute or relative path to the directory where to store result CSV files.skip-html-decoding
(Optional) Some of the files (e.g., Posts.xml) contain escaped HTML. By default, the converter will decode them. To disable this behavior, use this flag.