Flume on Windows. Save data to remote HDFS

Here i will explain how to run Flume agent on windows machine and save the data to HDFS on remote cluster.

First things first, we need a flume binaries.

We can download file apache-flume-1.6.0-bin.tar.gz from this url.

Then, we need to create flume properties file. It can be placed in apache-flume-1.6.0-binconf folder.

# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'agent'

agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = h1

# For each one of the sources, the type is defined
# seq source will generate sequence of number. Is very useful for testing purposes
agent.sources.seqGenSrc.type = seq

# The channel can be defined as follows.
agent.sources.seqGenSrc.channels = memoryChannel

# Each sink's type must be defined
agent.sinks.h1.type = hdfs
agent.sinks.h1.hdfs.path = hdfs://YOUR_NAME_NODE:8020/user/flume/FlumeOnWindows/
agent.sinks.h1.hdfs.fileType = DataStream
agent.sinks.h1.hdfs.writeFormat = Text
agent.sinks.h1.hdfs.batchSize = 100
agent.sinks.h1.hdfs.rollSize = 0
agent.sinks.h1.hdfs.rollCount = 0

#Specify the channel the sink should use
agent.sinks.h1.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 1000000
agent.channels.memoryChannel.transactionCapacity = 1000

Next, we need to download winutils.exe. Here is an official repo: https://github.com/steveloughran/winutils

Also it can be downloaded from hortonworks: http://s3.amazonaws.com/public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

Let’s place it in “C:/hadoop/winutils.exe”. In this case “C:/hadoop” is going to be our HADOOP_HOME.

Let’s try to run it and see what’s happens..

bin/flume-ng agent -name agent -f conf/flume-conf.properties -property "flume.root.logger=DEBUG,LOGFILE,console;hadoop.home.dir=C:/hadoop"

Here is an error i got:

[ERROR – org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:145)] Failed to start agent because dependencies were not found in classpath. Error follows.
java.lang.NoClassDefFoundError: org/apache/hadoop/io/SequenceFile$CompressionType

This means our Flume installation is missing some hadoop jars.

In my case i need to add following jars to apache-flume-1.6.0-bin/lib folder:

  • commons-configuration-1.6.jar
  • commons-io-2.4.jar
  • hadoop-auth-2.7.1.2.3.2.0-2950.jar
  • hadoop-hdfs-2.7.1.2.3.2.0-2950.jar
  • htrace-core-3.1.0-incubating.jar

Required jars can be found on any node of your hadoop cluster. For Hortonworks is

/usr/hdp/current/hadoop-hdfs folder.

In your case it might be different set of jars depending on your hadoop distribution.

Also you might need to setup HDFS permissions correctly for your windows user, or set

dfs.permissions.enabled = false in hdfs-site.xml

So, now lets run the agent once again and see the result..

Flume output files list:

Flume agent on windows flies list

And file content:

0
1
2
3
4
5
6
7
8
9
10
...

Seems that’s it. Data ingestion is a very simple process with flume. Next time i will cover some custom flume sources to process twitter data.