CMU15-445 2022 Fall 通关记录 —— Project 1: Buffer Pool

最新推荐文章于 2026-06-20 18:32:07 发布

原创

最新推荐文章于 2026-06-20 18:32:07 发布 · 1.7k 阅读

标签

#哈希算法 #数据结构

指导书

Project #1 - Buffer Pool | CMU 15-445/645 :: Intro to Database Systems (Fall 2022) — 项目 #1 - 缓冲池 | CMU 15-445/645 :: 数据库系统简介（2022 年秋季）

Task #1：Extendible Hash Table

首先应当了解 可扩展哈希表 的概念，可以参考下面这篇文章：

Extendible Hashing (Dynamic approach to DBMS) - GeeksforGeeks — 可扩展哈希（DBMS 的动态方法） - GeeksforGeeks

下面这篇博客同样不错：

做个数据库：2022 CMU15-445 Project1 Buffer Pool Manager - 知乎 (zhihu.com)

该篇文章中以整数为例，详细描述了一个可扩展哈希表的展开过程。

特点用下面这张图就可以展现出来：

可以看出，可扩展哈希表的哈希性表现在：由 Directories 索引指向存放真正数据的 Buckets。Directories 中每个条目都有唯一的 id，哈希函数返回 id（当扩展发生时，id可能会发生变化），将数据映射到对应的Bucket中。

Directories 中的条目数量 = 2 ^ Global Depth，Global Depth也可以理解为目录中每个条目的位数。而Local Depth关联的则是 Buckets中可存放键值对的数目。

在BusTub提供的代码中，Bucket的数据结构为 list，其中存放的数据类型为pair，首先可以尝试完成Bucket类中的三个函数：Find、Insert、Remove，仅在 list 结构上操作即可。

需要完成的几个函数如下图所示：

可扩展哈希表的构造函数：

该类中所有的成员变量如下图所示：

// TODO(student): You may add additional private members and helper functions and remove the ones
// you don't need.

int global_depth_;    // The global depth of the directory
size_t bucket_size_;  // The size of a bucket
int num_buckets_;     // The number of buckets in the hash table
mutable std::mutex latch_;
std::vector<std::shared_ptr<Bucket>> dir_;  // The directory of the hash table

根据默认参数可知，该哈希表的初始 global_depth = 0，num_buckets = 1，所以可以想到的是在该构造函数中初始化一个Bucket，容量等于bucket_size, 初始的 local_depth = global_depth = 0.

可扩展哈希表的查找函数：

/**
 *
 * TODO(P1): Add implementation
 *
 * @brief Find the value associated with the given key in the bucket.
 * @param key The key to be searched.
 * @param[out] value The value associated with the key.
 * @return True if the key is found, false otherwise.
 */
auto Find(const K &key, V &value) -> bool;

根据 key 计算出 index 后，直接调用对应 Bucket 的查找函数即可

可扩展哈希表的移除函数：

/**
 *
 * TODO(P1): Add implementation
 *
 * @brief Given the key, remove the corresponding key-value pair in the bucket.
 * @param key The key to be deleted.
 * @return True if the key exists, false otherwise.
 */
auto Remove(const K &key) -> bool;

根据 key 计算出 index 后，直接调用对应 Bucket 的移除函数即可

可扩展哈希表的插入函数：
```
/**
 *
 * TODO(P1): Add implementation
 *
 * @brief Insert the given key-value pair into the bucket.
 *      1. If a key already exists, the value should be updated.
 *      2. If the bucket is full, do nothing and return false.
 * @param key The key to be inserted.
 * @param value The value to be inserted.
 * @return True if the key-value pair is inserted, false otherwise.
 */
auto Insert(const K &key, const V &value) -> bool;
```
这应该是最复杂的函数了。按照说明，需要考虑：
- Bucket未满：
  - key已存在，更新value的值
  - key不存在，直接插入
- Bucket已满，则需要执行以下步骤：
  - Bucket 的 local_depth = global_depth，增加 global_depth，且目录扩容为原来的两倍（此处指的是整个大小的扩容，是capacity）；
    
    如果 local_depth < global_depth，就继续下面两步；
  - 增加对应 Bucket 的 local_depth；
  - Bucket 拆分，重新分配目录指向的Bucket，以及 key-value 对儿
而且要明白，insert本质是个递归的过程，因为如果要拆分，就需要将键值对插入到新的Bucket中，直到插入为止。

因为可扩展哈希表需要根据 key 值计算出目录索引，下面的代码为计算索引的代码：
```
template <typename K, typename V>
auto ExtendibleHashTable<K, V>::IndexOf(const K &key) -> size_t {
  int mask = (1 << global_depth_) - 1;
  return std::hash<K>()(key) & mask;
}
```
可以看出，mask 计算的方式是令低位全部为1，对应global_depth的位为0。举例，一开始的global_depth = 0，则计算出的 mask = 0(b)，当发生第一次因Bucket满时而发生分裂的时候，global_depth = 1, 计算可得 mask = 01(b).

& mask：这行代码使用位与操作符（&）将计算得到的哈希值与掩码进行按位与操作，将哈希值限制在掩码范围内。由于掩码的低 global_depth_ 位全为 1，按位与操作将保留哈希值的低 global_depth_ 位，忽略高位，得到最终的索引位置。

如果 K = int，那么哈希函数的返回结果是整数值本身（其他类型请自行搜索）。假设 key = 3 = 011(b)，那么index = 11 & 01 = 1，如果 key = 2 = 10(b)，index = 10 & 01 = 0，和之前那篇文章举例使用的结论是一致的，即使用低global_depth位来区分不同的key。

另外就是如何扩展 dir，要考虑的一件事是扩展dir之后，相同的key要放在原本的Bucket中，举个课程中的例子：

可扩展哈希表的重分配Bucket函数：

根据文档说明，这个函数是可以不适用的，取决于具体的实现方式。

/**
* @brief Redistribute the kv pairs in a full bucket.
* @param bucket The bucket to be redistributed.
*/
auto RedistributeBucket(std::shared_ptr<Bucket> bucket) -> void;

有两种处理方法，一个是保留原始的bucket，创建一个新的bucket，此处需要注意如果重新分配原bucket中的元素时，如果求出来的bucket不在原bucket中时，需要删除原bucket中的对应值；

而另外一种处理方法就是创建两个新的bucket，这样就不必顾虑key-value重复的问题了。

第一版分裂函数：

// 拆分，分裂之前dir已经扩容完毕了,
// 且由于bucket是shared_ptr，目前dir中是存在两组完全相同的bucket的（但只需要其中两个相同的就可以完成分裂）
// 把原来的bucket分为两个，并把原本bucket中的[key, value]安置好
template <typename K, typename V>
auto ExtendibleHashTable<K, V>::RedistributeBucket(std::shared_ptr<Bucket> bucket) -> void {
     
     
  auto new_bucket = std::make_shared<Bucket>(bucket_size_, bucket->GetDepth());
  int mask =

最低0.47元/天解锁文章