How to capture output from Hive queries in Oozie is an essential question if you’re going to implement any ETL-like solution using Hive. Most commonly used approach is a shell-action, however it requires Hive CLI to be installed on each node, also it doesnt works for remote clusters. Here i wanted to share more generic approach using custom Java action. Continue reading
Step-by-step guide of how to proceed with twitter analytics tasks using Elastic MapReduce, DynamoDB and Amazon Data Pipeline.
In this post I will use Flume agent configured in previous post to deliver raw JSON data to S3 storage. Also, saying Twitter analytics i mean some aggregations like “Top 100 users mentioned per day” and “Top 100 Urls mentioned per day”. Continue reading