Universal Reddit Scraper - A comprehensive Reddit scraping/archival command-line tool.
This release contains code cleanup by upgrading the project structure to a Poetry
project, and refactoring a compute-heavy bottleneck within this program in Rust, drastically improving performance.
taisun
- A Python module written in Rust that contains the depth-first search algorithm and associated data structures for structured comments scraping. This library will eventually contain additional code that handles compute-heavy tasks.rust.yml
- Format and lint Rust code.python.yml
- Format and test Python code.manual.yml
- Build and deploy the mdBook
manual to GitHub Pages.mdBook
.urs/
code.YYYY-MM-DD HH:MM:SS
).STYLE_GUIDE.md
- The style is dictated by Black
and isort
for Python code, and rustfmt
for Rust.README.md
Black
and isort
.N/A
This release fixes an open issue.
PRAW v7.3.0 changed the Redditor
object's subreddit
attribute. This change breaks the Redditor scraper. It would be nice if all the tools worked as advertised.
Redditor.py
:
GetInteractions._get_user_subreddit()
- extracts subreddit
data from the UserSubreddit
object into a dictionary.test_Redditor.py
:
TestGetUserSubredditMethod().test_get_user_subreddit()
to test the new method.Redditor.py
:
GetInteractions._get_user_info()
calls the new GetInteractions._get_user_subreddit()
method to set the Redditor's subreddit
data within the main Redditor information dictionary.Version.py
:
README
-t
, which will display a visual tree of the current day's scrape directory by default. Optionally, include a different date to display that day's scrape directory.-t
/--tree
- display the directory structure of the current date directory. Or optionally include a date to display that day's scrape directory.Utilities.py
to the urs/utils
module.
DateTree
which contains methods to find and build a visual tree for the target date's directory.
README
-t
/--tree
and --check
utility flags.test_Utilities.py
under the test_utils
module.analytics
module:
GetPath.get_scrape_type()
GetPath.name_file()
FinalizeWordcloud().save_wordcloud()
pathlib
's Path()
method to get the path.%
operator) to the superior f-string
.pytest.yml
.
ubuntu-latest
, macOS-latest
, and windows-latest
) and to send test coverage to Codecov after testing completes on ubuntu-latest
.README
%
operator) to the superior f-string
in the following modules:
test_utils/test_Export.py
test_praw_scrapers/test_live_scrapers/test_Livestream.py
test_Export.py
:
TestExportWriteCSVAndWriteJSON().test_write_csv()
TestExportExportMethod().test_export_write_csv()
PULL_REQUEST_TEMPLATE.md
.
.travis.yml
- URS no longer uses Travis-CI as its CI provider.-lr
- livestream a Subreddit-lu
- livestream a Redditor--stream-submissions
-v
/--version
to display the version number.live_scrapers
within praw_scrapers
for livestream functionality:
Livestream.py
utils/DisplayStream.py
utils/StreamGenerator.py
Version.py
to single-source the package version.gallery_data
and media_metadata
check in Comments.py
, which includes the above fields if the submission contains a gallery.README
live_scrapers
module. These tests are located in tests/test_praw_scrapers/test_live_scrapers
:
tests/test_praw_scrapers/test_live_scrapers/test_Livestream.py
tests/test_praw_scrapers/test_live_scrapers/test_utils/test_DisplayStream.py
tests/test_praw_scrapers/test_live_scrapers/test_utils/test_StreamGenerator.py
The Forest.md
praw_scrapers
module:
static_scrapers
sub-module:
Basic.py
Comments.py
Redditor.py
Subreddit.py
confirm_options()
, previously located in Subreddit.py
to Global.py
.PrepRedditor.prep_redditor()
algorithm to its own class method PrepMutts.prep_mutts()
.
KeyError
exception mentioned in the Issue Fix or Enhancement Request section.init()
method from many modules - it only needs to be called once and is now located in Urs.py
.requirements.txt
.README
DirInit.py
since the make_directory()
and make_type_directory()
methods have been deprecated.InitializeDirectory
class in DirInit.py
:
LogMissingDir.log()
create()
make_directory()
make_type_directory()
make_analytics_directory()
create_dirs()
method.--raw
flag.Credentials.py
has been deprecated in favor of .env
to avoid hard-coding API credentials.
Forest
and accompanying CommentNode
.
Forest
contains methods for inserting CommentNode
s, including a depth-first search algorithm to do so.Subreddit.py
has been refactored and submission metadata has been added to scrape files:
"author"
"created_utc"
"distinguished"
"edited"
"id"
"is_original_content"
"is_self"
"link_flair_text"
"locked"
"name"
"num_comments"
"nsfw"
"permalink"
"score"
"selftext"
"spoiler"
"stickied"
"title"
"upvote_ratio"
"url"
Comments.py
has been refactored and submission comments now include the following metadata:
"author"
"body"
"body_html"
"created_utc"
"distinguished"
"edited"
"id"
"is_submitter"
"link_id"
"parent_id"
"score"
"stickied"
Redditor.py
on top of adding additional metadata.
"has_verified_email"
"icon_img"
"subreddit"
"trophies"
subreddit
objects are nested within comment
and submission
objects and contain the following metadata:
"can_assign_link_flair"
"can_assign_user_flair"
"created_utc"
"description"
"description_html"
"display_name"
"id"
"name"
"nsfw"
"public_description"
"spoilers_enabled"
"subscribers"
"user_is_banned"
"user_is_moderator"
"user_is_subscriber"
comment
objects will contain the following metadata:
"type"
"body"
"body_html"
"created_utc"
"distinguished"
"edited"
"id"
"is_submitter"
"link_id"
"parent_id"
"score"
"stickied"
"submission"
- contains additional metadata"subreddit_id"
submission
objects will contain the following metadata:
"type"
"author"
"created_utc"
"distinguished"
"edited"
"id"
"is_original_content"
"is_self"
"link_flair_text"
"locked"
"name"
"num_comments"
"nsfw"
"permalink"
"score"
"selftext"
"spoiler"
"stickied"
"subreddit"
- contains additional metadata"title"
"upvote_ratio"
"url"
multireddit
objects will contain the following metadata:
"can_edit"
"copied_from"
"created_utc"
"description_html"
"description_md"
"display_name"
"name"
"nsfw"
"subreddits"
"visibility"
interactions
are now sorted in alphabetical order.--raw
- Export comments in raw format instead (structure format is the default).env
file to store API credentials.README
Status
class in Global.py
.Forest
.--raw
flag to export to raw format.submission_metadata
dictionary. "data"
is now a dictionary that contains the submission metadata dictionary and scraped comments list. Comments are now stored in the "comments"
field within "data"
.--csv
flag if it is present while trying to use either scraper.created_utc
field for each Subreddit rule is now converted to readable time.requirements.txt
has been updated.
numpy
has dropped support for Python 3.6, which means Python 3.7+ is required for URS.
.travis.yml
has been modified to exclude Python 3.6. Added Python 3.9 to test configuration.Validation.py
.Urs.py
no longer pulls API credentials from Credentials.py
as it is now deprecated.
.env
file.Validation.py
to ensure an extra Halo line is not rendered on failed credential validation.README
How to Get PRAW Credentials.md
to reflect new changes.c_fname()
test because submission comments scrapes now follow a different naming convention.0
comments does not only export all comments to raw format anymore. Defaults to structured format.Global.py
:
eo
options
s_t
analytical_tools
Credentials.py
has been replaced with the .env
file.LogError.log_login
decorator has been deprecated due to the refactor within Validation.py
.--json
flag is deprecated
-e
- Display additional example usage.--check
- Runs a quick check for PRAW credentials and displays the rate limit table after validation.--rules
- Include the Subreddit's rules in the scrape data (for JSON only). This data is included in the subreddit_rules
field.-f
- Word frequencies generator.-wc
- Wordcloud generator.--nosave
- Only display the wordcloud; do not save to file.scrapes
directory is missing, which would cause the new make_analytics_directory()
method in DirInit.py
to fail.
README
--csv
flag is required to export to CSV instead.scrape_details
field.
subreddit
, category
, n_results_or_keywords
, and time_filter
.redditor
and n_results
.submission_title
, n_results
, and submission_url
.data
field.
data
is a list containing submission objects.data
is an object containing additional nested dictionaries:
information
- a dictionary denoting Redditor metadata,interactions
- a dictionary denoting Redditor interactions (submissions and/or comments). Each interaction follows the Subreddit scrapes structure.data
is an list containing additional nested dictionaries.
comment_id: SUBMISSION_METADATA
.replies
field in the submission metadata, holding a list of additional nested dictionaries of comment_id: SUBMISSION_METADATA
. This pattern repeats down to third level replies.raw_file
field.data
is a dictionary containing word: frequency
.scrapes.log
is now named urs.log
.Basic.py
, further streamlining conditionals in Subreddit.py
and Export.py
.LogPRAWScraper
class in Logger.py
.not_found
list for submission comments scraping.
README
PULL_REQUEST_TEMPLATE
:
STYLE_GUIDE
:
Releases
:
README
to a separate document.GetPRAWScrapeSettings.get_settings()
to circumvent this issue.all
would be applied to categories that do not support time filter use, resulting in errors while scraping.all
time filter or None
accordingly.--json
flag since it is now the default export option.subreddits
, redditors
, or comments
directories.
redditors
directory will not be created if you never run the Redditor scraper.README
tree
command.STYLE_GUIDE
to reflect new changes and made a minor change to the PRAW API walkthrough.Controversial
, Search
, and Top
.all
day
hour
month
week
year
.github/
directory: STYLE_GUIDE
, and PULL_REQUEST_TEMPLATE
.README
to reflect new changes.scrapes/
directory within a subdirectory corresponding to the date of the scrape. These directories are automatically created for you when you run URS.scrapes.log
. The log is stored in the same subdirectory corresponding to the date of the scrape.BUG_REPORT
, CONTRIBUTING
, FEATURE_REQUEST
, PULL_REQUEST_TEMPLATE
, and STYLE_GUIDE
docs/
.