array_agg
by Lak Lakshmanan
由Lak Lakshmanan
探索强大SQL模式:ARRAY_AGG,STRUCT和UNNEST (Exploring a powerful SQL pattern: ARRAY_AGG, STRUCT and UNNEST)
It can be extremely cost-effective (both in terms of storage and in terms of query time) to use nested fields rather than flatten out all your data. Nested, repeated fields are very powerful, but the SQL required to query them looks a bit unfamiliar. So, it’s worth spending a little time with STRUCT, UNNEST and ARRAY_AGG. Using these three in combination also makes some kinds of queries much, much easier to write.
使用嵌套字段而不是拼合所有数据可能会极具成本效益(在存储和查询时间方面)。 嵌套的重复字段非常强大,但是查询它们所需SQL看起来有点陌生。 因此,值得在STRUCT,UNNEST和ARRAY_AGG上花一些时间。 结合使用这三个选项还可以使某些查询变得非常容易编写。
任务 (Task)
Let’s take a BigQuery table of tropical cyclones. Here’s a preview of the table:
让我们来一个热带气旋的BigQuery表 。 这是表格的预览:
The task is to find the maximum usa_sshs (better known as “category”) reached by each North American hurricane (basin=NA) of the 2010 season and the time at which the category was first reached. I want to be able to say something like “Hurricane Danielle reached Category 4 at 18:00 UTC on 2010–08–27 when it was at (27.1, -60.1)”.
任务是找到2010年度每个北美飓风( usa_sshs basin=NA )以及首次达到该类别的时间所达到的最大usa_sshs (更好的称为“类别”)。 我想说些类似的话“丹尼尔飓风在2010-08-27的世界标准时间18:00到达(47.1)-60.1)。
Here’s the solution query. In this article, I will build it piece-by-piece.
这是解决方案查询 。 在本文中,我将逐步构建它。
飓风在哪里? (Where’s the hurricane?)
My first step was to create a history of hurricane locations. Essentially, I want to get to:
我的第一步是创建飓风位置的历史。 本质上,我想去:
We can filter by basin and season:
我们可以按盆地和季节过滤:
#standardsqlWITH hurricanes AS (SELECT NAME, iso_time, latitude, longitude, usa_sshsFROM `bigquery-public-data.noaa_hurricanes.hurricanes`WHERE season = '2010' AND basin = 'NA')SELECT * from hurricanes LIMIT 5
But this gives us a jumble of rows that meet the necessary criteria. What we need is to get an ordered list of locations for each hurricane. Just adding a GROUP BY to the above query won’t work. (Why not? Try it out!)
但这使我们杂乱无章地满足了必要的条件。 我们需要的是获得每个飓风的有序列表的位置。 仅将GROUP BY添加到上述查询中是行不通的。 (为什么不呢?试试看!)
This query, however, works:
但是,此查询有效:
#standardsqlWITH hurricanes AS (SELECT MIN(NAME) AS name, ARRAY_AGG(STRUCT(iso_time, latitude, longitude, usa_sshs) ORDER BY iso_time ASC) AS trackFROM `bigquery-public-data.noaa_hurricanes.hurricanes`WHERE season = '2010' AND basin = 'NA'GROUP BY sid)
SELECT * from hurricanes LIMIT 5
Let’s tease the query apart:
让我们梳理一下查询:
We group by storm id, but when we group, we get a bunch of rows. Often what we’d do is to do an aggregation such as
SUM()orAVG()of the rows in the group to come down to just one value per row of the result set.我们按风暴ID分组,但是当我们分组时,会得到很多行。 通常,我们要做的是对组中的行进行汇总(例如
SUM()或AVG(),以使结果集的每一行减少到一个值。To retain all the rows in the group, use
ARRAY_AGG(). In this array, we don’t want just one field, we want four. So, I make the four fields (time, lat, lon, hurricane strength) a struct. The struct allows me to retain the element-by-element relationship between these four columns.要保留组中的所有行,请使用
ARRAY_AGG()。 在这个数组中,我们不想要一个字段,我们想要四个。 因此,我将四个字段(时间,纬度,经度,飓风强度)作为一个结构。 该结构允许我保留这四列之间的逐元素关系。- Order the array by time. 按时间排序数组。
最大类别 (Maximum category)
Now that we have each hurricane’s history, let’s find out the maximum category reached by the hurricane. What we want is:
现在我们已经掌握了飓风的历史,让我们找出飓风达到的最大类别。 我们想要的是:
Here’s the additional WITH:
这是其他WITH :
WITH hurricanes AS ( ...),
cat_hurricane AS (SELECT name,track, (SELECT MAX(usa_sshs) FROM UNNEST(track)) AS categoryfrom hurricanesORDER BY category DESC)
SELECT * from cat_hurricane
Selecting the name from the hurricanes table is quite obvious. It’s just a column. But what does selecting track do? Because track is an array, you get the whole array.
从飓风表中选择名称非常明显。 这只是一列。 但是选择track什么作用? 因为track是一个数组,所以可以得到整个数组。
To get a single row from the track array, we need to go through UNNEST(). When you call UNNEST(track), it makes a table, so the UNNEST() can only be used in the FROM clause of BigQuery. Once you understand that UNNEST(track) makes a table with four columns (the four columns in the STRUCT), you see that MAX(usa_sshs) simply computes the maximum strength reached by each hurricane.
要从track数组中获取一行,我们需要执行UNNEST() 。 当您调用UNNEST(track) ,它将创建一个表,因此UNNEST()只能在BigQuery的FROM子句中使用。 一旦了解了UNNEST(track)制成的表包含四列( STRUCT的四列),您就会看到MAX(usa_sshs)只是计算每个飓风达到的最大强度。
达到最大类别的时间 (Time at which maximum category is reached)
How do we find the time at which the maximum category is reached? Essentially, find all the rows in the UNNEST(track) table for which the usa_sshs column is the maximum category and limit it to 1, to get the first row at which category is met:
我们如何找到达到最大类别的时间? 本质上,在UNNEST(track)表中找到usa_sshs列是最大类别的所有行,并将其限制为1,以获取满足该类别的第一行:
SELECT name, category, (SELECT AS STRUCT iso_time, latitude, longitude FROM UNNEST(track) WHERE usa_sshs = category ORDER BY iso_time LIMIT 1).*FROM cat_hurricaneORDER BY category DESC, name ASC
Here’s the full query. Do play around with some variants to understand what is happening:
这是完整的查询 。 尝试一些变体,以了解发生了什么:
Why do I have the
.*? Play around with the query to see what happens if I don’t include the.*? (Hint: it has to do with the name of the column).为什么有
.*? 播放查询以查看如果不包含.*会发生什么情况? (提示:它与列的名称有关)。What happens if I don’t do the
AS STRUCTabove?如果我不执行上述
AS STRUCT,会发生什么情况?What happens if I don’t do the
LIMIT 1?如果我不执行
LIMIT 1怎样?
array_agg
本文探讨了一种强大的SQL模式,使用ARRAY_AGG, STRUCT和UNNEST处理嵌套和重复字段,以提高存储效率和查询速度。通过举例说明如何利用这些技术来查询BigQuery中的热带气旋历史数据,特别是找到每个北美飓风在2010赛季达到的最大类别及其首次达到的时间。"
121244485,11307796,消防设施操作员考试必看:模拟试题及解析,"['考试', '安全员', '学习', '题库']

5542

被折叠的 条评论
为什么被折叠?



