amplab软件安装_加州大学伯克利分校AMPLab的大数据基准

最新推荐文章于 2026-01-21 02:04:07 发布

翻译最新推荐文章于 2026-01-21 02:04:07 发布 · 433 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

原文链接：https://www.systutorials.com/big-data-benchmark-from-amplab-of-uc-berkeley/

标签

#数据库 #大数据 #hive #hadoop #java

加州大学伯克利分校AMPLab发布的大数据基准测试对比了包括Redshift、Hive、Shark、Impala和Tez在内的五种系统，涵盖了关系查询、扫描、聚合、联接和UDF的响应时间，旨在提供大数据分析系统的定量和定性比较。

amplab软件安装

Benchmarks are important to understand the performance and quantitative and qualitative comparison of different systems. Many analytic frameworks, such as Hive, Impala and Shark, are designed and implemented these years and become fundamental software for processing big data. How to benchmark these big data analytic systems is an interesting problem.

基准对于了解不同系统的性能以及定量和定性比较非常重要。这些年设计和实现了许多分析框架，例如Hive，Impala和Shark，并成为处理大数据的基本软件。如何对这些大数据分析系统进行基准测试是一个有趣的问题。

大数据基准∞ (The Big Data Benchmark ∞)

The Big Data Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative comparisons of five systems by the time this post is written: Redshift – a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse, Hive – a Hadoop-based data warehousing system, Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework, Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine and Stinger/Tez – Tez is a next generation Hadoop execution engine currently in development.

加州大学伯克利分校AMPLab的《大数据基准》提供了五个系统的定量和定性比较：Redshift –由Amazon .com基于ParAccel数据仓库提供的托管MPP数据库，Hive –基于Hadoop的数据仓库系统，Shark –运行在Spark计算框架之上的与Hive兼容SQL引擎，Impala –具有自己的类似于MPP的执行引擎和Stinger / Tez的与Hive兼容的* SQL引擎– Tez是下一代Hadoop执行引擎目前正在开发中。

正在评估什么∞ (What is being evaluated ∞)

As stated by the benchmark website:

如基准网站所述：

This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes. Keep in mind that these systems have very different sets of capabilities. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF’s), tolerating failures, and scaling to thousands of nodes. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. The workload here is simply one set of queries that most of these systems these can complete.

该基准测试可测量多种关系查询的响应时间：跨不同数据大小的扫描，聚合，联接和UDF。请记住，这些系统具有非常不同的功能集。类似于MapReduce的系统（Shark / Hive）以灵活，大规模的计算为目标，支持复杂的用户定义功能（UDF），容忍故障并扩展到数千个节点。传统的MPP数据库严格遵守SQL，并针对关系查询进行了优化。这里的工作负载只是一组查询，大多数这些系统都可以完成。

数据集∞ (Datasets ∞)

The dataset is an important part of a benchmark if others want to reproduce or verify the results. The Big Data Benchmark provides hosted datasets on S3. The largest dataset is around 270 GB which is for 5-node tests. The datasets the benchmark provides was generated using Intel’s Hadoop Benchmark Suite (HiBench) and data sampled from the Common Crawl document corpus.

如果其他人想要重现或验证结果，则数据集是基准的重要组成部分。大数据基准测试提供了S3上的托管数据集。最大的数据集约为270 GB，用于5节点测试。基准测试提供的数据集是使用英特尔的Hadoop基准套件（HiBench）生成的，并从Common Crawl文档语料库中采样了数据。

翻译自: https://www.systutorials.com/big-data-benchmark-from-amplab-of-uc-berkeley/

amplab软件安装