Custom Twitter source for Apache Flume

This post is about advanced custom Twitter source for Apache Flume.

In previous post i’ve described flume installation and configuration. I will use the same EC2 node in this article. But everything i talk here will work for any other Apache Flume installation.

Oftenly i see how people are trying to use a built-in twitter source or cloudera twitter source, and i want to uncover some drawbacks of those sources usage.

Let’s check the built-in twitter source first. As mentioned by flume is:

So, the first issue here is that it does support only sample stream, and the second (as for me) – is a limited set of fields in avro schema.

Cloudera’s example looks much better, it allows to filter stream by keywords, also it saves the data in raw JSON format with all possible fields included.

Now let’s talk about what we can impove here.

  • If we check the code of both sources it looks like:
Those events are important part of itneracting with twitter streaming API, so is not good to ignore them at all.

  • Add support of flume counters. Since we want to see some detailed metrics about source status in our monitoring tool.
You can find full version of code on GitHub page.

Now let’s test our custom flume source. I will simply adjust config from previous post.

Note that Twitter Streaming API treats parameters keywords, follow and locations with logical “OR“.

If you need to apply more than one filter (with logical “AND“) you need to implement the filtering manually. F.e. filter by locations first and then manually filter out tweets by keywords or vice versa depending on cardinality of each filter.

Now let’s put our jar along with twitter4j dependencies to some place aside the flume main lib folder to not mess up the main installation classpath in case other agents run on it.

ls $FLUME_HOME/aux_lib

flume-twitter-source-0.0.1-SNAPSHOT.jar twitter4j-stream-4.0.4.jar
twitter4j-core-4.0.4.jar

Run the agent:

And check the Ganglia Web UI:

Ganglia_twitter_1Ganglia_twitter_2

So, we’re done with Flume agent configuration. In next post i will cover EMR cluster config to process this data.

One thought on “Custom Twitter source for Apache Flume

Comments are closed.