Download subreddit comments
Download all the text comments from a subreddit
Use the script subreddit_downloader.py
multiple times to download the data.
Then run the script dataset_builder.py for create a unique
dataset.
🖱 More info on website and medium.
Basic usage to download submissions and relative comments from subreddit AskReddit and News:
# Use python 3.8.5
# Install the dependencies
pip install -r requirements.txt
# Download the AskReddit comments of the last 30 submissions
python src/subreddit_downloader.py AskReddit --batch-size 10 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username>
# Download the News comments after 1 January 2020
python src/subreddit_downloader.py News --batch-size 512 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username> --utc-after 1609459201
# Build the dataset, the results will be under `./dataset/` path
python src/dataset_builder.py
<...>
on the previous scriptParameter name | Description | How get it | Example of the value |
---|---|---|---|
reddit_id |
The Client ID generated from the apps page | Official guide | 40oK80pF8ac3Cn |
reddit_secret |
The secret generated from the apps page | Copy the value as showed here | 9KEUOE7pi8dsjs9507asdeurowGCcg |
reddit_username |
The reddit account name | The name you use for log in | pistoSniffer |
A new folder with two csv files are created from dataset_builder.py
, the script have some features:
id
caching_size
parameter to don't store all dataset in RAMThey have the following structure:
Each row is a submission of a specific subreddit and id
field is unique across the dataset (PK).
Column name | Description | Example |
---|---|---|
subreddit | Name of the subreddit | MTB |
id | Unique identifier of the submission | lhr2bo |
created_utc | UTC when submission was created | 1613068060 |
title | Title of the submission | Must ride So... |
selftext | Text off the submission | What are the best trails to ride in... |
full_link | Reddit unique link to the submission | https://www.reddit.com/r/MTB/comments/lhr2bo/must_ride_so_cali_trails/ |
Each row is a comment under a submission of a specific subreddit and id
field is unique across the dataset (PK).
Column name | Description | Example |
---|---|---|
subreddit | Name of the subreddit | News |
id | Unique identifier of the comment | gmz45xo |
submission_id | Id of the comment main submission | lhr2bo |
body | Text of the comment | We're past the point... |
created_utc | UTC when comment was created | 1613072734 |
parent_id | Id of the parent in a tree structure | t3_lhssi4 |
permalink | Reddit unique link to the comment | /r/news/comments/lhssi4/air_force_wants_to_know_if_key_pacific_airfield/gmz45xo/ |
subreddit: section of reddit website focused on a particular topic
submission: the post that appear in each subreddit. When you open a subreddit page, all the posts you see. Each submission has a tree of _ comments_
comment: text wrote by a reddit user under a submission inside a subreddit
subreddit_downloader.py
script under the --help
command:/data/<subreddit>/<timestamp>/comments/xxx.csv
/data/<subreddit>/<timestamp>/submissions/xxx.csv
equivalent file (same xxx.csv
name) and open the submission link--debug
flag to get in which submission the program is freezing--comments-cap
parameter.comments_cap
times to the praw API, and don't download all comments.
python src/subreddit_downloader.py --help
Usage: subreddit_downloader.py [OPTIONS] SUBREDDIT
Download all the submissions and relative comments from a subreddit.
Arguments:
SUBREDDIT The subreddit name [required]
Options:
--output-dir TEXT Optional output directory [default: ./data/]
--batch-size INTEGER Request `batch_size` submission per time [default:
10]
--laps INTEGER How many times request `batch_size` reddit
submissions [default: 3]
--reddit-id TEXT Reddit client_id, visit https://github.com/reddit-
archive/reddit/wiki/OAuth2 [required]
--reddit-secret TEXT Reddit client_secret, visit
https://github.com/reddit-archive/reddit/wiki/OAuth2
[required]
--reddit-username TEXT Reddit username, used for build the `user_agent`
string, visit https://github.com/reddit-
archive/reddit/wiki/API [required]
--utc-after TEXT Fetch the submissions before this UTC date
--utc-before TEXT Fetch the submissions before this UTC date
--comments-cap INTEGER Some submissions have 10k> nested comments and stuck
the praw API call.If provided, the system requires
new comments `comments_cap` times to the praw
API.`comments_cap` under the hood will be passed
directly to `replace_more` function as `limit`
parameter. For more info see the README and visit ht
tps://asyncpraw.readthedocs.io/en/latest/code_overvi
ew/other/commentforest.html#asyncpraw.models.comment
_forest.CommentForest.replace_more.
--debug / --no-debug Enable debug logging [default: False]
--install-completion Install completion for the current shell.
--show-completion Show completion for the current shell, to copy it or
customize the installation.
--help Show this message and exit.
dataset_builder.py
subreddit_downloader.py
subreddit_downloader.py
from local config filedataset_builder.py:_rows_parser
: find a more efficient approach to check id
duplicates