Twitter analytics with Amazon EMR and DynamoDB. Part 1

Step-by-step guide of how to proceed with twitter analytics tasks using Elastic MapReduce, DynamoDB and Amazon Data Pipeline.

In this post I will use Flume agent configured in previous post to deliver raw JSON data to S3 storage. Also, saying Twitter analytics i mean some aggregations like “Top 100 users mentioned per day” and “Top 100 Urls mentioned per day”.

Overall, similar process is described in amazon blog. But in my case i will cover a bit more details related to the twitter data.

Example of EMR cluster configuration:

EMR_TestSetup

Later in the post i will use AWS CLI tool to create and deploy jobs to the cluster, so is good to prepare it as well. For initial configuration of AWS Management Console run the following command:

Also you can follow instructions provided here if you have any issues with console configuration.

Once you have done with test cluster and AWS CLI configuration, we can proceed with actual data processing..  In this post i will create following things:

  • Hive metadata for S3 external table and DynamoDB external table
  • Hive script to load aggregated data from S3 to DynamoDB on daily basis
  • CMD and bash scripts to deploy and run everything on EMR cluster with Bootstrap actions and custom cluster steps

So, let’s prepare everything step-by-step..

Hive tables

Our twitter agent downloads the data in JSON format. To read the data from Hive table we need a Json SerDe library. Here is the most commonly used one: https://github.com/rcongiu/Hive-JSON-Serde. Is not shipped with Hive by default, so we need to compile it and put to the deployment folder.

Here i created a table with all possible fields from twitter. In fact is not necessary and you can create table with the fields needed for your processing only.

Second table should be mapped to DynamoDB:

Where DynamoDB table looks like:

DynamoDB twitter table properties
DynamoDB twitter table properties

 

Aggregation script

Add existing S3 partitions to hive metadata:

Load data from S3 to DynamoDB (Top 100 User Mentioned by day):

Cluster deployment scripts

Let’s create bootstrap script for EMR cluster:

Is required to create Hive tables and execute queries over raw Json data in s3 storage.

Afterall, my local deployment folder looks like

Let’s put it to s3:

Create cluster and submit job

Almost there.. To submit some job to the cluster we need to prepare steps (units of work for EMR)

Prepare step to create Hive tables:

Create aggregation step:

Pay attention to the substitution value for LAST_SUCCESS_DAY variable . Is set to 20160101 just for cluster testing. It will be replaced with real values in Amazon Data Pipeline.

Finally, command to setup and run EMR cluster:

Is important to specify –auto-terminate, since is turned off by default for the clusters created from CLI. If you omit this flag your cluster will keep running after all steps executed. I’ve specified this command for Linux. Just put it all into single line if you want to run on Windows.

This is how my DynamoDB table looks like after cluster termination:

DynamoDB Data Example

In Part 2 of this post i will cover Data Pipeline configuration to schedule this processing on daily basis.