Learn how to use Kinesis Firehose, AWS Glue, S3, and Amazon Athena by streaming and analyzing reddit comments in realtime. 100-200 level tutorial.
AWS provides several key services for an easy way to quickly deploy and manage data streaming in the cloud. Reddit is a popular social news aggregation, web content rating, and discussion website. At peak times, Reddit can see over 300,000 comments and 35,000 submissions an hour. The Reddit API offers developers a simple way to collect all of this data, which is a perfect use case to learn how to use Kinesis Firehose, S3, Glue, and Athena.
In this tutorial, you will play the role of a data architect looking to modernize a company’s streaming pipeline. You will create a Kinesis Firehose delivery stream from an EC2 server to an S3 data lake. With the help of AWS Glue and Amazon Athena, you’ll be able to develop insights on the data as it accumulates in your data lake.
This tutorial requires:
Follow prompts to create new reddit account:
Once your account is created, go to reddit developer console.
Select “are you a developer? Create an app...”
Give it a name.
Select script. <--- This is important!
For about url and redirect uri, use http://127.0.0.1
You will now get a client_id (underneath web app) and secret
Keep track of your Reddit account username, password, app client_id (in blue box), and app secret (in red box). These will be used in tutorial Step 11
Open the Amazon S3 console
Choose Create bucket
In the Bucket name field, type a unique DNS-compliant name for your new bucket. Create your own bucket name using the following naming guidelines:
The name must be unique across all existing bucket names in Amazon S3
Example: reddit-analytics-bucket-<add random number here>
After you create the bucket you cannot change the name, so choose wisely
Choose a bucket name that reflects the objects in the bucket because the bucket name is visible in the URL that points to the objects that you're going to put in your bucket
For information about naming buckets, see Rules for Bucket Naming in the Amazon Simple Storage Service Developer Guide
For Region, choose US East (N. Virginia) as the region where you want the bucket to reside
Keep defaults and continue clicking Next
Choose Create
Now that you’ve created a bucket, let’s set up a delivery stream for your data.
In this step we will be using a tool called CloudFormation. Instead of going through the AWS console and creating glue databases and glue tables click by click, we can utilize CloudFormation to deploy the infrastructure quickly and easily.
We will use Cloudformation YAML templates located in this GitHub repository
Go to the glue.yml file located here
Right-click anywhere and select Save as…
Rename the file from glue.txt to glue.yml
Select All Files as the file format and select Save
Open the AWS CloudFormation console
If this is a new AWS CloudFormation account, click Create New Stack Otherwise, click Create Stack
In the Template section, select Upload a template file
Select Choose File and upload the newly downloaded glue.yml template
Decide on your stack name
Under pBucketName set your bucket name from the previous step
Continue until the last step and click Create stack
Click on Events tab. Wait until the stack status is CREATE_COMPLETE
Open the Kinesis Data Firehose console or select Kinesis in the Services dropdown
Choose Create Delivery Stream
Delivery stream name – Type a name for the delivery stream
Example: raw-reddit-comment-delivery-stream
Keep default settings on Step 1 - you will be using a direct PUT as source. Scroll down and click Next
In Step 2, enable record format conversion by using the following settings:
Click Next
On the Destination page, choose the following options
Destination – Choose Amazon S3
S3 bucket – Choose an existing bucket created in tutorial Step 6
S3 prefix – add "raw_reddit_comments/" as prefix
S3 error prefix - add "raw_reddit_comments_error/" as prefix
Choose Next
On the Configuration page, Change Buffer time to 60 seconds
For IAM Role, click Create new or choose
For the IAM Role summary, use the following settings:
Choose Allow
You should return to the Kinesis Data Firehose delivery stream set-up steps in the Kinesis Data Firehose console
Choose Next
On the Review page, review your settings, and then choose Create Delivery Stream
Open the Amazon EC2 console or select EC2 under Services dropdown
In the navigation pane, under NETWORK & SECURITY, choose Key Pairs
Note: The navigation pane is on the left side of the Amazon EC2 console. If you do not see the pane, it might be minimized; choose the arrow to expand the pane
Choose Create Key Pair
For Key pair name, enter a name for the new key pair (ex: RedditBotKey), and then choose Create
The private key file is automatically downloaded by your browser. The base file name is the name you specified as the name of your key pair, and the file name extension is .pem. Save the private key file in a safe place
Important: This is the only chance for you to save the private key file. You'll need to provide the name of your key pair when you launch an instance and the corresponding private key each time you connect to the instance
A key pair will allow you to securely access a server. In the next steps, you will deploy the server.
In this step you will be using a tool called CloudFormation. Instead of going through the AWS console and creating an EC2 instance click by click, you can utilize CloudFormation to deploy the infrastructure quickly. This CloudFormation template has EC2 user data to set up the machine. The EC2 user data achieves the following:
We will use Cloudformation YAML templates located in this GitHub repository.
Go to the ec2.yml file located here.
Right-click anywhere and select Save as…
Rename the file from ec2.txt to ec2.yml
Select All Files as the file format and select Save
Open the AWS CloudFormation console or select CloudFormation under the Services dropdown
Click Create New Stack / Create Stack
In the Template section, select Upload a template file
Select Choose File and upload the newly downloaded ec2.yml template
Click Next
Provide a stack name (ex: reddit-stream-server)
For pKeyName and provide the key name that you created in tutorial Step 9
Use your reddit app info and reddit account for the parameters pRedditAppSecret, pRedditClientID, pRedditUsername, and pRedditPassword
You can choose to leave the rest of the parameters as their default values.
Continue to click Next
On the last step, acknowledge IAM resource creation and click Create Stack
Wait for your EC2 instance to be created.
Make a note of the Public IP and Public DNS Name given to the newly created instance. You can find these in the Cloudformation Outputs tab.
Open the Amazon EC2 console or select EC2 under Services dropdown
Select INSTANCES in the navigation pane
Ensure that an EC2 instance has been created and running. (This can take several minutes to deploy)
Open the Amazon Kinesis Firehose console or select Kinesis in the Services dropdown
Select the delivery stream created in step 8.
Select Monitoring Tab
Click refresh button over the next 3 minutes. You should start to see records coming in
If you are still not seeing data after 3-5 minutes, go to Appendix I for troubleshooting.
Now let’s check the S3 bucket
Open the Amazon S3 console or select S3 in the Services dropdown
Click the bucket name of the bucket that you created in step 2
Verify that records are being PUT into your s3 bucket
Now that data is streaming into s3, let’s build a data catalog so that you can query our s3 files
Open the Amazon Athena console or select Athena in the Services dropdown
Choose the glue database (reddit_glue_db) populated on the left view
Select the table (raw_reddit_comments) to view the table schema
You should now be able to use SQL to query the table (S3 data)
Here are some example queries to begin exploring the data streaming into S3:
-- total number of comments
select count(*)
from raw_reddit_comments;
-- general sentiment of reddit Today
select round(avg(comment_tb_sentiment), 4) as avg_comment_tb_sentiment
from raw_reddit_comments
where comment_date
like '%2019-08-22%';
-- total comments collected per subreddits
select count(*) as num_comments, subreddit
from raw_reddit_comments
group by subreddit
order by num_comments DESC;
-- average sentiment per subreddits
select round(avg(comment_tb_sentiment), 4) as avg_comment_tb_sentiment, subreddit
from raw_reddit_comments
group by subreddit
order by avg_comment_tb_sentiment DESC;
-- list all Subreddits
select distinct(subreddit)
from raw_reddit_comments;
-- top 10 most positive comments by subreddit
select subreddit, comment_body
from raw_reddit_comments
where subreddit = '${subreddit}'
order by comment_tb_sentiment DESC
limit 10;
-- most active subreddits and their sentiment
select subreddit, count(*) as num_comments, round(avg(comment_tb_sentiment), 4) as avg_comment_tb_sentiment
from raw_reddit_comments
group by subreddit
order by num_comments DESC;
-- search term frequency by subreddit where comments greater than 5
select subreddit, count(*) as comment_occurrences
from raw_reddit_comments
where LOWER(comment_body) like '%puppy%'
group by subreddit
having count(*) > 5
order by comment_occurrences desc;
-- search term sentiment by subreddit
select subreddit, round(avg(comment_tb_sentiment), 4) as avg_comment_tb_sentiment
from raw_reddit_comments
where LOWER(comment_body) like '%puppy%'
group by subreddit
having count(*) > 5
order by avg_comment_tb_sentiment desc;
-- top 25 most positive comments about a search term
select subreddit, author_name, comment_body, comment_tb_sentiment
from raw_reddit_comments
where LOWER(comment_body) like '%puppy%'
order by comment_tb_sentiment desc
limit 25;
-- total sentiment for search term
SELECT round(avg(comment_tb_sentiment), 4) as avg_comment_tb_sentiment
FROM (
SELECT subreddit, author_name, comment_body, comment_tb_sentiment
FROM raw_reddit_comments
WHERE LOWER(comment_body) LIKE '%puppy%')
EC2 – Our EC2 instance was created from a CloudFormation template, we’ll delete the stack and the key pair
Kinesis –
Glue –
S3 –
In this tutorial, you have walked through the process of deploying a sample Python application that uses the Reddit API and AWS SDK for Python to stream Reddit data into Amazon Kinesis Firehose. You learned basic operations to deploy a real-time data streaming pipeline and data lake. Finally, you developed insights on the data using Amazon Athena’s ad-hoc SQL querying.
Find the Public IP address that you noted down in Step 10 and the key pair you downloaded in Step 9.
Open up a Terminal
Go to the directory that your key pair was downloaded to.
Ensure key has correct permissions
chmod 400 <key pair name>.pem
SSH into the machine with the following command:
ssh -i <insert your key pair name here> ec2-user@<insert public IP address here>
Confirm that the correct credentials have been added to your application with the following command:
sudo cat /reddit/analyzing-reddit-sentiment-with-aws/python-app/praw.ini
Confirm that the correct delivery stream name was added to your application with the following command. Look for DeliveryStreamName=’<your delivery stream name>’
sudo cat /reddit/analyzing-reddit-sentiment-with-aws/python-app/comment-stream.py
If there are errors found, delete the CloudFormation stack that didn’t work properly. Go back and retry Step 10. If there are no errors you can check the logs:
sudo tail /tmp/reddit-stream.log
Some common errors include:
DEBUG:prawcore:Response: 503 (Reddit servers are down)
DEBUG:prawcore:Response: 502 (Reddit server request error)
DEBUG:prawcore:Response: 403 (Your Reddit username/password is incorrect)
This sample code is made available under the MIT-0 license. See the LICENSE file.