Twitter analytics with Amazon EMR and DynamoDB. Part 2 – Amazon Data Pipeline

This post covers Amazon Data Pipeline configuration to load Twitter data from S3 to DynamoDB using EMR on daily basis.

In Part 1 i have described how to setup and deploy EMR cluster for our ETL process. Now is time to automate it with AWS Data Pipeline.

Data Pipeline is a service that allows to schedule processing and moving your data between different amazon services like S3, EMR, DynamoDB, Redshift etc.

In my case pipeline looks like:

pipeline

For me it was easier to design pipeline in web browser, but here i will provide full Json listing:

I only highlighted the lines that need to be changed (name of the S3 bucket)

In this pipeline we’re creating EMR instance (EmrClusterObj) and EMR Activity (EmrActivityObj) to run 2 Hive scripts on that instance.

Parameters and bootstrap actions were described in part 1 of this post.

After all, we can analyze our data in DynamoDB table. In my case is done using CData’s ODBC driver and Microsoft Power BI:

Mentions Analysis

Now, the interesting part – overall estimated price of such solution is around 20$

Sure, is a draft process, but it shows overall idea, also it shows that big data projects on AWS are not so expensive, at least at the beginning:)