Apache Flume Ganglia monitoring on AWS EC2

This is the first post in series about Twitter Data processing using Amazon EC2.

Here i will describe step-by-step how to set up the Apache Flume Ganglia monitoring on the single EC2 instance.

1. EC2 configuration

t2.micro + AMI Linux

Flume is a very lightweight tool, so it doesn’t requires too much processing power to stream some data in single process mode, thats why i’ve choose t2.micro instance.

2. Ganglia installation

sudo yum install ganglia ganglia-gmond ganglia-gmetad ganglia-web
sudo service gmetad start
sudo service gmond start
sudo service httpd start

If you want to start those services automatically during EC2 startup,
run the following commands:

sudo chkconfig gmond on
sudo chkconfig gmetad on
sudo chkconfig httpd on

After that we should be able to access Ganglia UI by url like:

http://your-instance-public-dns-name/ganglia

and see some basic metrics about your EC2 instance.

In case of getting error

403 Forbidden. You don’t have permission to access /ganglia on this server.

You can check the answer provided here: http://stackoverflow.com/questions/23515934/unable-to-view-ganglia-dashboard

3. Ganglia configuration

Next step we need to configure ganglia demons.

By default it gets proper configuration for generic linux server, but is not EC2 case.

I will follow simple logic here – i have configured amazon EMR cluster, which already includes ganglia with proper configs, so i will use them as a templates.

Here is the changes needs to be done:

data_source "my cluster" 127.0.0.1:8649

and Gmond:

globals {
 daemonize = yes
 setuid = yes
 user = ganglia
 debug_level = 0
 max_udp_msg_len = 1472
 mute = no
 deaf = no
 allow_extra_data = yes
 host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */
 host_tmax = 20 /*secs */
 cleanup_threshold = 300 /*secs */
 gexec = no
 # By default gmond will use reverse DNS resolution when displaying your hostname
 # Uncommeting following value will override that value.
 # override_hostname = "mywebserver.domain.com"
 # If you are not using multicast this value should be set to something other than 0.
 # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable
 send_metadata_interval = 60 /*secs */
}

cluster {
 name = "ec2Flume"
 owner = "unspecified"
 latlong = "unspecified"
 url = "unspecified"
}
......
udp_send_channel {
 #bind_hostname = yes # Highly recommended, soon to be default.
 # This option tells gmond to use a source address
 # that resolves to the machine's hostname. Without
 # this, the metrics may appear to come from any
 # interface and the DNS names associated with
 # those IPs will be used to create the RRDs.
 host = your_internal_hostname.ec2.internal
 port = 8649
 ttl = 1
}

udp_recv_channel {
 port = 8649
 retry_bind = true
 # Size of the UDP buffer. If you are handling lots of metrics you really
 # should bump it up to e.g. 10MB or even higher.
 # buffer = 10485760
}

tcp_accept_channel {
 port = 8649
 # If you want to gzip XML output
 gzip_output = no
}

To get your internal hostname just run this command:

hostname

Also another useful link to read about custom ganglia installation and configuration:

http://blog.kenweiner.com/2010/10/monitor-hbase-hadoop-with-ganglia-on.html

and ganglia wiki, for sure.

4. Flume installation

Flume is not shipped with standart repositories, so we have to install it manually.

wget http://ftp.piotrkosoft.net/pub/mirrors/ftp.apache.org/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
tar -xzvf apache-flume-1.6.0-bin.tar.gz
export FLUME_HOME=/home/ec2-user/apache-flume-1.6.0-bin

5. Flume configuration

To test the flume installation we will use built-in sequence source.
First, prepare the flume.properties file:

agent.sources = seqGenSrc
agent.channels = MemCh
agent.sinks = s3

agent.sources.seqGenSrc.type = seq
agent.sources.seqGenSrc.channels = MemCh

agent.sinks.s3.channel = MemCh
agent.sinks.s3.type = hdfs
agent.sinks.s3.hdfs.path = s3n://AWS_ACCESS_KEY_ID:AWS_ACCESS_KEY_SECRET@bucket/sampleseq/
agent.sinks.s3.hdfs.fileType = DataStream
agent.sinks.s3.hdfs.filePrefix = SampleSequence
agent.sinks.s3.hdfs.writeFormat = Text
agent.sinks.s3.hdfs.inUsePrefix = _
agent.sinks.s3.hdfs.maxOpenFiles = 10
agent.sinks.s3.hdfs.batchSize = 100
agent.sinks.s3.hdfs.rollSize = 0
agent.sinks.s3.hdfs.rollCount = 100
agent.sinks.s3.hdfs.rollInterval = 0

agent.channels.MemCh.type = memory
agent.channels.MemCh.capacity = 10000
agent.channels.MemCh.transactionCapacity = 1000

Flume won’t run successfully without lot’s of libraries to access S3 storage, so before moving to the next step, we need to copy them to the $FLUME_HOME/lib

To get the proper version of libs is easier to follow the same logic as we did for ganglia:

Go to configured EMR cluster and get the libs from there.

In my case here is the list of required libs:

aws-java-sdk-1.10.56.jar
aws-java-sdk-config-1.10.56.jar
aws-java-sdk-core-1.10.56.jar
aws-java-sdk-ec2-1.10.56.jar
aws-java-sdk-efs-1.10.56.jar
aws-java-sdk-emr-1.10.56.jar
aws-java-sdk-events-1.10.56.jar
aws-java-sdk-iot-1.10.56.jar
aws-java-sdk-s3-1.10.56.jar

hadoop-auth-2.7.2-amzn-0.jar
hadoop-aws-2.7.2-amzn-0.jar
hadoop-common-2.7.2.jar
hadoop-distcp-2.7.2-amzn-0.jar
hadoop-hdfs-2.7.2-amzn-0.jar
htrace-core-3.1.0-incubating.jar
httpclient-4.3.4.jar
httpcore-4.3.2.jar
jets3t-0.9.0.jar

6. Run the agent

Finally, to run the flume agent i will the following command:

$FLUME_HOME/bin/flume-ng agent -n agent \
-f $FLUME_HOME/conf/flume-conf.properties \
-c $FLUME_HOME/conf \
-Dflume.root.logger=INFO,LOGFILE,console \
-Dflume.log.file=FlumeAgentSeq.log \
-Dflume.monitoring.type=ganglia \
-Dflume.monitoring.hosts=<code>hostname</code>.ec2.internal:8649

You might see the following warning message in the log, which looks pretty like an exception:)

[WARN – org.apache.flume.sink.hdfs.BucketWriter.getRefIsClosed(BucketWriter.java:183)] isFileClosed is not available in the version of HDFS being used. Flume will not attempt to close files if the close fails on the first attempt
java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3native.NativeS3FileSystem.isFileClosed(org.apache.hadoop.fs.Path)

In fact this is by design of aws libs and you shouldn’t worry about it.

Some more details about this warning can be found in FLUME-2427 jira ticket

If you dont see any other exceptions, most probably flume is running successfully and finally you can check your ganglia web UI. In my case it looks like this:

Flume with Ganglia
Flume with Ganglia monitoring example

 

Thats it. In the next post i will cover custom twitter source build and configuration process based on the existing setup.

UPD. If you need to startup the agent automatically, do the following:

Setup flume agent auto startup

One thought on “Apache Flume Ganglia monitoring on AWS EC2

Comments are closed.