Fun w/ Conditional & Aggregation Functions
One problem that came up pretty early one is how to run aggregation functions conditionally. The actual use case came up when I was trying to count the number of certain events that a particular user triggered when visiting our website.
The data can be pictured as a table of events (login, sign in, etc) that had a user id associated with them. The idea is to produce counts and averages (aggregations) of the events. One can think of it as turning the original table over sideways so that what were the rows are now the columns (sort of...)
So, imagine the data as like:
TABLE 1
user id, event
01, login
01, sign_in
01, event_02
02, sign_in
02, event_03
...
And so on. The goal was to make it look like
TABLE 2
user id, login, sign_in, sign_out, event_01, event_02 ... etc
01,1,1,0,0,1, ...
In this example we simply want to count, in TABLE 1, the number of events associated with the event type column in TABLE 2.
How to do this? A quick look HIVE UDF docs shows we've got aggregation functions as well as condition functions. The relevant ones are:
count(expr) - Returns the number of rows for which the supplied expression is non-NULL.
and
CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
Combining them would give us something like
SELECT
user_id,
count(event),
count(CASE event WHEN "login" THEN 1 ELSE NULL END),
count(CASE value WHEN "sign_in" THEN 1 ELSE NULL END),
count(CASE value WHEN "sign_out" THEN 1 ELSE NULL END),
count(CASE value WHEN "event_02" THEN 1 ELSE NULL END).
... etc
FROM
TABLE 1
GROUP BY
user_id













