Hive UDFs in Ruby (and other languages)

Apache Hive is a very powerful tool for processing data stored in Apache Hadoop. Structured and unstructured data can be accessed, processed, and manipulated using a SQL-like query language. This architecture allows anyone with reasonable SQL knowledge to write complex jobs with little to no knowledge of Hadoop, HDFS, and Hive.

-- Create a summary of purchases by product and day
SELECT day, product, count(*) as purchases, sum(revenue) as revenue, avg(revenue) as average_revenue_by_purchase
FROM purchases
WHERE day between '2014-01-01' and '2014-01-31'
GROUP BY day, product
ORDER BY product, day;

With only SQL knowledge very powerful data extractions can be performed, but SQL can be limiting.

Enter UDFs

Hive supports User-Defined Functions as a way to add more complex capabilities than are available in SQL. There is a good list of included UDFs. Here at LivingSocial we have a few of our own in our HiveSwarm project. There are a few other bundles we like Klout Brickhouse and stewi2's UDFs. These UDFs greatly expand the capabilities of Hive by including features not available in standard SQL.

Developing new UDFs requires knowledge of the Hive API and Java. This is fine for Hadoop developers, but is a significant barrier for non-Java developers.

Enter Scripted UDFs

Java supports running code in other languages through the javax.script API. Our HiveSwarm project has a new scripted UDF that makes use of this so UDFs can be written in languages other than Java. As a bonus this allows scripts to be directly in the Hive SQL instead of being compiled and jared before they can be run.

This shows how it can be used to compute data that would normally be difficult to generate via SQL.

create temporary function scriptedUDF as 'com.livingsocial.hive.udf.ScriptedUDF';
-- Gather complex data combining groups and individual rows without joins
 select person_id, purchase_data['time'], purchase_data['diff'],
   purchase_data['product'], purchase_data['purchase_count'] as pc,
   purchase_data['id_json']
 from (
   select person_id, scriptedUDF('
 require "json"
 def evaluate(data)
   # This gathers all the data about purchases by person in one place so 
   # complex infromation can be gathered while avoiding complex joins
   # Note:  In order for this to work all the data passed into 
   # scriptedUDF for a row needs to fit into memory
   tmp = []  # convert things over to a ruby array
   tmp.concat(data)
   tmp.sort_by! { |a| a.get("time") } # for the time differences
   last=0
   tmp.map{ |row|
     # Compute the time difference between purchases and add the total 
     # purchase count per person
     t = row["time"]
     # The parts that would be much more difficult to generate with SQL
     row["diff"] = t - last
     row["purchase_count"] = tmp.length
     row["first_purchase"] = tmp[0]["time"]
     row["last_purchase"] = tmp[-1]["time"]
     # This shows that built-in libraries are available
     row["id_json"] = JSON.generate({"id" => row["id"]})
     last = t
     row
   }
 end', 'ruby', 'array<map<string,string data-preserve-html-node="true">>',
        -- gather all the data about purchases by people so it can all be 
        -- passed into the evaluate function
        bh_collect(map(   -- Note, bh_collect is from Klouts Brickhouse 
                          -- and allows collecting any type, 
                          -- see https://github.com/klout/brickhouse/
           'time', purchase_time,
           'product', product_id)) ) as all_data
     from purchases
    group by person_id
 ) foo
 -- explode the data back out so it is available in flattened form
 lateral view explode(all_data) bar as purchase_data

Celebrating Heroes 2012 Triathlon

What a great race! Thanks to the Columbia Triathlon Association for keeping everything running smoothly.

Everything race-wise went well. The only minor things that went wrong were forgetting to start my watch on time, forgetting to the off my gloves in T2, and some minor cramping on the run. All of those are things i can fix. I think I'm ready to go for the big race in a few weeks.

Zombies, Run!

I'm not a big fan of running. The last few weeks I've been running with Zombies, Run!. This has made running much more enjoyable.

The story in the game is a great distraction on a short or long run. I run with the zombie chases enabled which adds an extra level of excitement that keeps you on your toes listening for the "Zombies Detected" warning.

For now it is iPhone only, but it is coming to Android soon.

The New iPad

The retina display got me. I could ignore the faster processor and the nifty cover in the iPad 2 and stick with my first generation one. I did the same with the iPhone 3 and 3GS. I wasn't sure how much of a difference the retina display would make, but the iPhone 4 got me hooked. I'm looking forward to not being able to see the pixels on the new iPad.

Social networking for work

I've been trying out groups on LinkedIn, but for Hadoop and Java the posts are dominated by recruiters and minor flame wars. The Hadoop and associated technology mailing lists are far more informative and useful. It looks like twitter is another place for seeing what's going on in Hadoop land. I need to spend more time there.

Whenever I'm looking for a new job I'll head back to LinkedIn. Until then I need something better.

A new start for the site

I'm moving everything over from my old host to Squarespace. I'm trying to get in the habit of posting more and this move will make it a lot easier.

I had a lot of old pictures up on my old site and I haven't figured out what to do with them. Most are not that good, but there are a few I really want to keep. At some point I'll either move them in here or switch to Flickr or Picasa to host them.

For now just posting something is a good start.