Using Window Functions in SQL
Window functions allow us to perform calculations on a subset of rows in a table, rather than the entire table. A window function performs a calculation across a set of rows that are related to the current row, based on a specified window of observations. They are used to calculate running totals, ranks, percentiles and other aggregate calculations that can help identify patterns or trends within groups. Let’s look at how to use window functions in SQL. The code below shows a moving average calculation that uses a window function. It applies the average function over a window with five observations that is partitioned by a stocks symbol and ordered by time. This article is written with code and documentation from Kinetica, but the functions and concepts are applicable to any ANSI SQL compliant analytical tool that supports window functions. Scroll down to the bottom for more information on how to run the sample code using a free version of Kinetica.
AVG(price_close) OVER
(
PARTITION BY symbol
ORDER BY time
ROWS 5 PRECEDING
) AS avg_price
Apply Function OVER a Window
At the most abstract level, we can think of a window function as being applied on a “window” of rows that are related to the current row. The specified window slides with the “current row” so that the set of values that a window function is applied to updates with each row (see animation below). The window function generates one result value per row thereby allowing us to preserve the original granularity of the data. The window function above calculates the rolling average price for the past five observations using a window for the AAPL (Apple) stock price.
ORDER BY
The animation below illustrates the `ORDER BY` clause. The table is initially not ordered by the timestamp column. As a result, the average is not being computed on the previous five observations. We can fix this by simply ordering the data on the time column. Note that the window function does not update the actual ordering of the table itself, it simply reorders the rows for the sake of performing the calculation without mutating the table itself. This is important since it doesn’t inadvertently affect other queries that might require the table to be ordered on a column other than time.
PARTITION BY
We can address this by partitioning the data on the stock symbol column.
Framing a Window
So far, we have positioned the window above the current row with an arbitrary number of rows as its size. But what if we wanted to position the window elsewhere (above the current row, including the current row, from start, etc.)? The frame clause inside a window specification offers a great deal more flexibility and control over the sizing of a window. There are two types of frame clauses, rows and range. A range clause considers the values of the records in the ordering column. While a row clause considers the ordering of the records. This can be a bit hard to understand when written down but the distinction between the two crystallizes when we encounter duplicate (or “peer”) values in the ordering column. The tables in the illustration below have duplicate values in the ordering column time. Let’s ignore the partitioning for now. A row clause that specifies a frame with one row preceding and following the row will result in a frame with three observations in it — the row preceding the current row, the current row and the row following the current row. The range clause on the other hand yields a window that has nine rows in it — all the records with value of time (the ordering column) that fall within a range one less than or one more than the time of the current row. This includes all records with a time in the range 9:42 to 9:44.
There are a few options when setting the start and end points for a frame. These are listed below along with their description.
Frame start keywords |
UNBOUNDED PRECEDING | The first row of the partition |
| <number> PRECEDING | Either n rows before the current row (for rows-based frames) or n values less than the current row’s value (for range-based frames) | |
| CURRENT ROW | Either the current row (for rows-based frames) or the current row and its peer rows (for range-based frames) | |
| <number> FOLLOWING | Either n rows after the current row (for rows-based frames) or n values greater than the current row’s value (for range-based frames) | |
Frame end keywords |
UNBOUNDED FOLLOWING | The last row of the partition |
| <number> FOLLOWING | Either n rows after the current row (for rows-based frames) or n values greater than the current row’s value (for range-based frames) | |
| CURRENT ROW | Either the current row (for rows-based frames) or the current row and its peer rows (for range-based frames) | |
| <number> PRECEDING | Either n rows before the current row (for rows-based frames) or n values less than the current row’s value (for range-based frames) |
A Moving Average with Row-Based Frames
The query below is an illustration of a simple moving average. The partition by clause separates the rows based on the stock symbol, the ordering clause arranges the rows within each stock partition in descending order of time, and the row clause specifies a window frame with five observations before the current row.
SELECT
time,
symbol,
AVG(price_close) OVER
(
PARTITION BY symbol
ORDER BY time
ROWS 5 PRECEDING
) AS mov_avg_price
FROM trades
A Moving Average with Range-Based Frames
The query below is similar to the one above but with one important difference. We are using a range-based frame instead of a row. The range clause specifies a frame with all observations that have a time value within five minutes (300,000 milliseconds) of the current row. This includes observations before and after the current row.
SELECT
time,
symbol,
AVG(price_close) OVER
(
PARTITION BY symbol
ORDER BY time
RANGE BETWEEN 300000 PRECEDING AND 300000 FOLLOWING
) AS mov_avg_price
FROM trades
Ranking
So far we have seen aggregate functions that summarize the observations in a particular window frame. Ranking functions comprise another type of window functions. They are used to compare data points within a given series or between different series and determine their relative importance. In the query below, I am using a ranking function to rank the closing prices of different stocks. Notice how the ordering column is now the closing price and not time. This is because the rank is determined based on the ordering column.
SELECT
symbol,
time
price_close,
RANK() OVER
(
PARTITION BY symbol
ORDER BY price_close DESC
) AS high_price_rank
FROM trades
ORDER BY high_price_rank