Oozie – Capture output from Hive query

How to capture output from Hive queries in Oozie is an essential question if you’re going to implement any ETL-like solution using Hive. Most commonly used approach is a shell-action, however it requires Hive CLI to be installed on each node, also it doesnt works for remote clusters. Here i wanted to share more generic approach using custom Java action.

The idea of the code is to write single row of Hive output to some variable. As per Oozie doc:

To query Hive i will use Hive2 server and JDBC driver. This approach allows to run this code from any environment, even if you’re trying to connect to remote cluster. Off course atleast Hive JDBC driver should be in your classpath, usually is already included to Oozie shared libs.

The essential code to do this looks like:

You can find full version of code on github.

Oozie workflow configuration:

After execution it will put the result to “max_timestamp” variable, which can be used this way:
Here it is. This small piece of code can be very helpful for ETL-like processing with Hive and Oozie.