Twitter analytics with Amazon EMR and DynamoDB. Part 2 – Amazon Data Pipeline

This post covers Amazon Data Pipeline configuration to load Twitter data from S3 to DynamoDB using EMR on daily basis.

In Part 1 i have described how to setup and deploy EMR cluster for our ETL process. Now is time to automate it with AWS Data Pipeline.

Continue reading

Oozie – Capture output from Hive query

How to capture output from Hive queries in Oozie is an essential question if you’re going to implement any ETL-like solution using Hive. Most commonly used approach is a shell-action, however it requires Hive CLI to be installed on each node, also it doesnt works for remote clusters. Here i wanted to share more generic approach using custom Java action. Continue reading

Twitter analytics with Amazon EMR and DynamoDB. Part 1

Step-by-step guide of how to proceed with twitter analytics tasks using Elastic MapReduce, DynamoDB and Amazon Data Pipeline.

In this post I will use Flume agent configured in previous post to deliver raw JSON data to S3 storage. Also, saying Twitter analytics i mean some aggregations like “Top 100 users mentioned per day” and “Top 100 Urls mentioned per day”. Continue reading