This post covers Amazon Data Pipeline configuration to load Twitter data from S3 to DynamoDB using EMR on daily basis.
How to capture output from Hive queries in Oozie is an essential question if you’re going to implement any ETL-like solution using Hive. Most commonly used approach is a shell-action, however it requires Hive CLI to be installed on each node, also it doesnt works for remote clusters. Here i wanted to share more generic approach using custom Java action. Continue reading
Step-by-step guide of how to proceed with twitter analytics tasks using Elastic MapReduce, DynamoDB and Amazon Data Pipeline.
In this post I will use Flume agent configured in previous post to deliver raw JSON data to S3 storage. Also, saying Twitter analytics i mean some aggregations like “Top 100 users mentioned per day” and “Top 100 Urls mentioned per day”. Continue reading
Here i will explain how to run Flume agent on windows machine and save the data to HDFS on remote cluster.
This is the first post in series about Twitter Data processing using Amazon EC2.
Here i will describe step-by-step how to set up the Apache Flume Ganglia monitoring on the single EC2 instance.