Reading Ingestion —— Bigtable: A Distributed Storage System for Structured Data

Bigtable是一个分布式存储系统,用于管理结构化数据,广泛应用于谷歌的产品中。其数据模型为稀疏的、持久的、多维排序映射,基于行键、列键和时间戳。系统依赖于GFS分布式文件系统和Chubby分布式锁服务。Bigtable的实现包括客户端库、一个主服务器和多个tablet服务器,每个tablet由三部分存储:行、列族和时间戳。系统支持原子性读写、高效的数据定位和多种压缩策略,确保了高可用性和高性能。

Bigtable: A Distributed Storage System for Structured Data

 \space 

Acquirement 收获

  1. Bigtable’s Data Model: is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp;
    (Bigtable的数据模型:由行键、列键、时间戳构成的map。)( Bigtable的数据模型:由行键、列键、时间戳构成的map。)(Bigtablemap)
  2. Bigtable uses the distributed Google File System (GFS) to store log and data files.
    (Bigtable用于存储数据和日志的GFS分布式文件系统。)(Bigtable用于存储数据和日志的GFS分布式文件系统。)(BigtableGFS)
  3. Bigtable relies on a highly-available and persistent distributed lock service called Chubby.
    (Chubby是Bigtable的高可用、持久分布式锁服务。)(Chubby是Bigtable的高可用、持久分布式锁服务。)(ChubbyBigtable)
  4. The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers.
    (Bigtable实现的主要组成部分由:每个client都连接的library、一个master、多个tablet server。)(Bigtable实现的主要组成部分由:每个client都连接的library、一个master、多个tablet \space server。)(Bigtableclientlibrarymastertablet server)
  5. We use a three-level hierarchy analogous to that of a B± tree to store tablet location information:
    (Bigtable的数据定位:用一个类似于B树的三层结构来存储tablet的位置信息。)(Bigtable的数据定位:用一个类似于B树的三层结构来存储tablet的位置信息。)(BigtableBtablet)
  6. The master keeps track of the set of live tablet servers by Chubby, and the current assignment of tablets to tablet servers, including which tablets are unassigned.
    (Bigtable的数据分配:master通过Chubby来追踪存活tablet servers、目前已分配的tablets到)(Bigtable的数据分配:master通过Chubby来追踪存活tablet \space servers、目前已分配的tablets到)(BigtablemasterChubbytablet serverstablets)
    (tablet server情况、未分配的tablet情况)(tablet \space server情况、未分配的tablet情况)(tablet servertablet)
  7. The Bigtable 's tablet serving:recover tablet、write operation、read operation.
    (Bigtable的tablet服务:tablet恢复、写操作、读操作。)(Bigtable的tablet服务:tablet恢复、写操作、读操作。)(Bigtabletablettablet)
  8. The Bigtable 's compaction:minor compaction、merging compaction、major compaction.
    (Bigtable的压缩:小压缩、合并压缩、大压缩。)(Bigtable的压缩:小压缩、合并压缩、大压缩。)(Bigtable)

 \space 

Introduction 引言

  1. A distributed storage system for managing structured data at Google called Bigtable.
    (Bigtable是用于管理结构化数据的分布式存储系统。)( Bigtable是用于管理结构化数据的分布式存储系统。)(Bigtable)
  2. Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability.
    (实现了广泛应用性、可扩展、高性能、高可用。)( 实现了广泛应用性、可扩展、高性能、高可用。)(广)
  3. These products use Bigtable for a variety of demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users.
    (适用于面向高吞吐量的批处理任务、对延迟敏感的服务类产品。)( 适用于面向高吞吐量的批处理任务、对延迟敏感的服务类产品。)()
  4. Bigtable provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage.
    (提供了支持动态控制数据布局及格式的简单数据模型。)( 提供了支持动态控制数据布局及格式的简单数据模型。)()
  5. Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.
    (其模式参数使得Client动态控制是从内存还是磁盘中提供数据。)( 其模式参数使得Client动态控制是从内存还是磁盘中提供数据。)(使Client)

 \space 

Data Model 数据模型

 \space 

Map 映射
  1. A Bigtable is a sparse, distributed, persistent multidimensional sorted map.
    (稀疏、分布式、持久的多位排序映射。)( 稀疏、分布式、持久的多位排序映射。)()
  2. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
    (由行键、列键、时间戳索引,每个值是一个无意义的字节数组。)( 由行键、列键、时间戳索引,每个值是一个无意义的字节数组。)()

(row:string,column:string,time:int64)(row:string,column:string,time:int64)(row:string,column:string,time:int64) →\rightarrow stringstringstring

 \space 

Webtable:a large collection of web pages and related information

在这里插入图片描述

  1. In Webtable,The row name is a reversed URL.
    (row key为URL。)( row\space key为URL。)(row keyURL)
  2. The contents column family contains the page contents.
    (contents column family包含了页的内容。)( contents \space column \space family包含了页的内容。)(contents column family)
  3. The anchor column family contains the text of any anchors that reference the page. CNN’s home page is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com and anchor:my.look.ca.
    (anchor column family包含了引用此页面的描点文本,包括了Sports Illustrated和MY−look)( anchor \space column \space family包含了引用此页面的描点文本,包括了Sports \space Illustrated 和 MY-look)(anchor column familySports IllustratedMYlook)
  4. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6.
    (每个anchor cell中只有一个版本,而每个contents column有三个版本)( 每个anchor\space cell中只有一个版本,而每个contents\space column有三个版本)(anchor cellcontents column)

 \space 

Rows 行
  1. Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row), a design decision that makes it easier for clients to reason about the system’s behavior in the presence of concurrent updates to the same row.
    (单个row key下的每个数据的读写操作是原子性的,不管这一行有几个不同列组成。)( 单个row\space key下的每个数据的读写操作是原子性的,不管这一行有几个不同列组成。)(row key)
    (这是方便Client考虑如何处理对同一行并发更新时的行为。)(这是方便Client考虑如何处理对同一行并发更新时的行为。)(便Client)
  2. Each row range is called a tablet, which is the unit of distribution and load balancing,and dynamically partitioned.
    (一个row range内的rows称为tablet,是动态分布和负载均衡的单位。)( 一个row\space range内的rows称为tablet,是动态分布和负载均衡的单位。)(row rangerowstablet)
  3. Bigtable maintains data in lexicographic order by row key.As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines.
    (由于row keys是按照字典排序的,所以通过row range可以高效读取并使得机器间通讯减少。)( 由于row\space keys是按照字典排序的,所以通过row\space range可以高效读取并使得机器间通讯减少。)(row keysrow range使)
  4. For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs:we store data for maps.google.com/index.html under the key com.google.maps/index.html.
    (比如在上述的Webtable中,将相同领域的页通过将URL的主机名颠倒的方法放在连续的行中。)( 比如在上述的Webtable中,将相同领域的页通过将URL的主机名颠倒的方法放在连续的行中。)(WebtableURL)
    (比如maps.google.com/index.html与com.google.maps/index.html。)(比如maps.google.com/index.html与com.google.maps/index.html。)(maps.google.com/index.htmlcom.google.maps/index.html)

 \space 

Column Families 列族
  1. Column keys are grouped into sets called column families, which form the basic unit of access control.
    (一个column familiy由一组相关的column key组成,是访问控制的单位。)( 一个column\space familiy由一组相关的column\space key组成,是访问控制的单位。)(column familiycolumn key访)
  2. A column family must be created before data can be stored under any column key in that family; after a family has
    been created, any column key within the family can be used.
    (一个column familiy创建之后,column key才可以被使用。)( 一个column\space familiy创建之后,column\space key才可以被使用。)(column familiycolumn key使)
  3. It is our intent that the number of distinct column families in a table be small (in the hundreds at most), and that families rarely change during operation. In contrast, a table may have an unbounded number of columns.
    (独立的column familiy往往只有数百个,而且很少在操作期间更改,而column数量不受限制。)( 独立的column\space familiy往往只有数百个,而且很少在操作期间更改,而column数量不受限制。)(column familiycolumn)
  4. A column key is named using the following syntax →\space\rightarrow  family:qualifier.family:qualifier.family:qualifier.
    (一个column key的格式如上所示,如 anchor:cnnsi.com)( 一个column\space key的格式如上所示,如\space anchor:cnnsi.com)(column key anchor:cnnsi.com)
  5. Access control and both disk and memory accounting are performed at the column-family level.
    (访问控制以及从磁盘还是内存中读取数据取决于column familiy的level。)( 访问控制以及从磁盘还是内存中读取数据取决于column\space familiy的level。)(访column familiylevel)
    (比如在Webtable中,允许根据它的level,将访问控制分为 增加新数据、)(比如在Webtable中,允许根据它的level,将访问控制分为\space 增加新数据、)(Webtablelevel访 )
    (读取数据或增加column familiy、仅浏览已存在数据等操作。)(读取数据或增加column\space familiy、仅浏览已存在数据等操作。)(column familiy)

 \space 

Timestamps 时间戳
  1. Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp.
    (相同的数据在Bigtable里可能有多个版本,这些版本由timestamp索引。)( 相同的数据在Bigtable里可能有多个版本,这些版本由timestamp索引。)(Bigtabletimestamp)
  2. Bigtable timestamps are unique, 64-bit integers,they represent “real time” in microseconds.
    (timestamp为了避免冲突,是独一无二的64位整数,是微秒级的“精确时间”。)( timestamp为了避免冲突,是独一无二的64位整数,是微秒级的“精确时间”。)(timestamp64)
  3. The client can specify either that only the last n versions of a cell be kept, or that only new-enough versions be kept.
    (相同数据通过timestamp在cell中有两种存储方式:保存n个最新版本或保存足够新的版本。)(相同数据通过timestamp在cell中有两种存储方式:保存n个最新版本或保存足够新的版本。)(timestampcelln)
  4. In our Webtable example, we set the timestamps of the crawled pages stored in the contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism described above lets us keep only the most recent three versions of every page.
    (在Webtable中,contents中的timestamp是通过网页被抓取的时间来设置的,并只保留最近的三个版本。)( 在Webtable中,contents中的timestamp是通过网页被抓取的时间来设置的,并只保留最近的三个版本。)(Webtablecontentstimestamp)
     \space 

API 操作函数

 \space 

Rowmutation
// 打开表
Table *T = OpenOrDie("/bigtable/web/webtable");
// 单行改动,指定r1为T表的"com.cnn.www"行
RowMutation r1(T, "com.cnn.www");
//增加一个新的anchor进anchor family column
r1.Set("anchor:www.c-span.org", "CNN");
//删除一个anchor出anchor family column
r1.Delete("anchor:www.abc.com");
Operation op;
//应用改动r1
Apply(&op, &r1);

																	Figure 2: Writing to Bigtable.

Figure 2 shows C++ code that uses a RowMutation abstraction to perform a series of updates. The call to Apply performs an atomic mutation to the Webtable.
(图2中用RowMutation代表某一行的一系列原子改动,并通过Apply具体执行。)(图2中用RowMutation代表某一行的一系列原子改动,并通过Apply具体执行。)(2RowMutationApply)

 \space 

Scanner
//定义一个浏览器,并绑定表T
Scanner scanner(T);
ScanStream *stream;
//浏览器绑定anchor column family
stream = scanner.FetchColumnFamily("anchor");
stream->SetReturnAllVersions();
//限制浏览"com.cnn.www"这一行
scanner.Lookup("com.cnn.www");
for (; !stream->Done(); stream->Next()) {
printf("%s %s %lld %s\n",
scanner.RowName(),
stream->ColumnName(),
stream->MicroTimestamp(),
stream->Value());
}
																   Figure 3: Reading from Bigtable

Figure 3 shows C++ code that uses a Scanner abstraction to iterate over all anchors in a particular row. Clients can iterate over multiple column families, and there are several mechanisms for limiting the rows, columns, and timestamps produced by a scan.
(图3中用Scanner浏览anchorcolumnfamily,并增加了指定行的限制条件。)(图3中用Scanner浏览anchor column family,并增加了指定行的限制条件。)(3Scanneranchorcolumnfamily)
(再其他情况中,限制条件也可以是列或者时间戳的。)(再其他情况中,限制条件也可以是列或者时间戳的。)()

 \space 

Other API
  1. Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key.
    (用于单行事务的API:在单个行键存储的数据上执行读/修改/写的序列。)(用于单行事务的API:在单个行键存储的数据上执行读/修改/写的序列。)(API//)

  2. Bigtable allows cells to be used as integer counters. (将cells作为整数计数器。)(将cells作为整数计数器。)(cells)

  3. Bigtable supports the execution of client-supplied scripts based on Sawzall in the address spaces of the servers.Sawzall-based API does not allow client scripts to write back into Bigtable, but it does allow various forms of data transformation.
    (在服务器的地址空间上执行基于Sawzall的客户端脚本。)(在服务器的地址空间上执行基于Sawzall的客户端脚本。)(Sawzall)
    (Sawzall的API不允许客户端将脚本写回倒服务器,但是允许数据的不同格式转换。)(Sawzall的API不允许客户端将脚本写回倒服务器,但是允许数据的不同格式转换。)(SawzallAPI)

  4. We have written a set of wrappers that allow a Bigtable to be used both as an input source and as an output target for MapReduce jobs.
    (Bigtable写了一系列的包用于作为MapReduce任务的输入源或者输出目标。)(Bigtable写了一系列的包用于作为MapReduce任务的输入源或者输出目标。)(BigtableMapReduce)

 \space 

Building Blocks 底层块

 \space 

Bigtable’s distributed implemention 分布式实现
  1. Bigtable processes often share the same machines with processes from other applications.
    (Bigtable进程与其他分布式应用的进程共享同一个机器池里的机器。)(Bigtable进程与其他分布式应用的进程共享同一个机器池里的机器。)(Bigtable)
  2. Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status.
    (Bigtable通过集群管理系统来调度作业、管理共享机器上的资源、处理机器错误、监视机器状态等。)(Bigtable通过集群管理系统来调度作业、管理共享机器上的资源、处理机器错误、监视机器状态等。)(Bigtable)
GFS 底层分布式文件系统
  1. Bigtable uses the distributed Google File System (GFS) to store log and data files.
    (Bigtable通过GFS来存储log及数据文件。)(Bigtable通过GFS来存储log及数据文件。)(BigtableGFSlog)

  2. The Google SSTable file format is used internally to store Bigtable data.
    (GFS内部的文件格式为SStable)(GFS内部的文件格式为SStable)(GFSSStable)

    (1) Map:An SSTable provides a persistent, ordered immutable map from keys to values.
    (SStable提供了持久、有序、不可变的键值映射。)(SStable提供了持久、有序、不可变的键值映射。)(SStable)
    (2) Operations: Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range.
    (通过某个特定的键查找值,会遍历指定键范围内的所有键值对。)(通过某个特定的键查找值,会遍历指定键范围内的所有键值对。)()
    (3) Consists: Each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index is used to locate blocks; the index is loaded into memory when the SSTable is opened.
    (每个SSTable包含了一个block序列,以及一个用来定位block的索引。)(每个SSTable包含了一个block序列,以及一个用来定位block的索引。)(SSTableblockblock)
    (这个索引会在SSTable打开时加载倒内存中。)(这个索引会在SSTable打开时加载倒内存中。)(SSTable)
    (4) Lookup:We first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk.
    (进行某个查找时,先从内存中的index中查找到指定的block,再从磁盘中读block。)(进行某个查找时,先从内存中的index中查找到指定的block,再从磁盘中读block。)(indexblockblock)
    (注意:可以先将整个SStable映射到缓存,就省去了磁盘读的步骤。)(注意:可以先将整个SStable映射到缓存,就省去了磁盘读的步骤。)(SStable)

Chubby 分布式锁服务
  1. Bigtable relies on a highly-available and persistent distributed lock service called Chubby.
    (Chubby是Bigtable的高可用、持久分布式锁服务。)(Chubby是Bigtable的高可用、持久分布式锁服务。)(ChubbyBigtable)
  2. A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests. The service is live when a majority of the replicas are running and can communicate with each other. Chubby uses the Paxos algorithm to keep its replicas consistent.
    (Chubby服务由5个副本集组成,其中一个被选举成为master并提供服务。)(Chubby服务由5个副本集组成,其中一个被选举成为master并提供服务。)(Chubby5master)
    (当大多数的副本为运行状态,服务为live,Chubby通过Paxos来保证遇到问题时的副本集一致性。)(当大多数的副本为运行状态,服务为live,Chubby通过Paxos来保证遇到问题时的副本集一致性。)(liveChubbyPaxos)
  3. Chubby provides a namespace that consists of directories and small files. Each directory or file can be used as a lock, and reads and writes to a file are atomic. The Chubby client library provides consistent caching of Chubby files.
    (Chubby为directory和file提供命名空间,它们都被当做锁,对它们的读写操作都是原子的。)(Chubby为directory和file提供命名空间,它们都被当做锁,对它们的读写操作都是原子的。)(Chubbydirectoryfile)
    (所有的Chubby客户端都有一份一致的files缓存。)(所有的Chubby客户端都有一份一致的files缓存。)(Chubbyfiles)
  4. Each Chubby client maintains a session with a Chubby service. When a client’s session expires, it loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration.
    (每个client与service保持对话。当client的对话过期,会失去所有的锁和开始句柄。)(每个client与service保持对话。当client的对话过期,会失去所有的锁和开始句柄。)(clientserviceclient)
    (Client也可以对files和directories的改变或者对话超时注册一个回调函数,再事件发生时通知。)(Client也可以对files和directories的改变或者对话超时注册一个回调函数,再事件发生时通知。)(Clientfilesdirectories)
  5. Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data; to discover tablet servers and finalize tablet server deaths ; to store Bigtable schema information (the column family information for each table); and to store access control lists.
    (Chubby用到的场景:保证任何时候只有一个活跃的master、存储数据位置、)(Chubby用到的场景:保证任何时候只有一个活跃的master、存储数据位置、)(Chubbymaster)
    (发现tablet服务器并确定其过期时间、存储每个表的columnfamilyinformation、存储访问控制列表。)(发现tablet服务器并确定其过期时间、存储每个表的column family information、存储访问控制列表。)(tabletcolumnfamilyinformation访)

 \space 

Implemention 具体实现

 \space 

Bigtable implemention’s major components 具体实现主要组成部分
  1. The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers.
    (Bigtable实现的主要组成部分由:每个client都连接的library、一个主服务器、多个tablet服务器。)(Bigtable实现的主要组成部分由:每个client都连接的library、一个主服务器、多个tablet服务器。)(Bigtableclientlibrarytablet)

  2. The master is responsible for :
    (1)assigning tablets to tablet servers.(分配tablet到tabletserver。)(分配tablet到tablet server。)(tablettabletserver)
    (2)detecting the addition and expiration of tablet servers.
    (检测tabletserver的增加与过期。)(检测tablet server的增加与过期。)(tabletserver)
    (3)balancing tablet-server load by dynamically added(or removed ).
    (根据负载情况动态地增加或移除tabletserver。)(根据负载情况动态地增加或移除tablet server。)(tabletserver)
    (4)garbage collection of files in GFS. (回收GFS里的垃圾文件。)(回收GFS里的垃圾文件。)(GFS)
    (5)In addition, it handles schema changes such as table and column family creations.
    (更改模式,如表或columnfamily的增删。)(更改模式,如表或column family的增删。)(columnfamily)

  3. Client’s communication
    As with many single-master distributed storage systems, client data does not move through the master: clients communicate directly with tablet servers for reads and writes. Because Bigtable clients do not rely on the master for tablet location information,.
    (与许多其他的单主分布式存储系统一样,clientdata不通过master传递,而直接与tabletserver通信。)(与许多其他的单主分布式存储系统一样,client data不通过master传递,而直接与tablet server通信。)(clientdatamastertabletserver)
    (因为Bigtableclient并不依靠master来定位tablet,因此master的负担很小。)(因为Bigtable client并不依靠master来定位tablet,因此master的负担很小。)(Bigtableclientmastertabletmaster)

 \space 

Tablet Location 数据定位

  1. We use a three-level hierarchy analogous to that of a B± tree to store tablet location information:
    (用一个类似于B树的三层结构来存储tablet的位置信息。)(用一个类似于B树的三层结构来存储tablet的位置信息。)(Btablet)
    (1) The first level is a file stored in Chubby that contains the location of the root tablet.
    (第一层为Chubby文件,包含了root tablet的位置)(第一层为Chubby文件,包含了root \space tablet的位置)(Chubbyroot tablet)
    (2) The second level is the root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets.
    (第二层为root tablet,其包含了一个特殊metadata table的所有tablet位置。)(第二层为root \space tablet,其包含了一个特殊metadata \space table的所有tablet位置。)(root tabletmetadata tabletablet)
    (每个metadata tablet保存了一个用户tablet的位置集。)(每个metadata \space tablet 保存了一个用户tablet的位置集。)(metadata tablettablet)
    (3) The third level is other METADATA tablets.
    (第三层为其它的metadata tablets,不同于root tablet的特殊,)(第三层为其它的metadata \space tablets,不同于root \space tablet的特殊,)(metadata tabletsroot tablet)
    (为了保证只有三层,root tablet不会再分裂)(为了保证只有三层,root \space tablet不会再分裂)(root tablet)
    (4) The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet’s table identifier and its end row.
    (metadata tablet通过一个行键保存一个tablet位置,行键为tablet的table标识符及它的最后一行。)(metadata \space tablet 通过一个行键保存一个tablet位置,行键为tablet的table标识符及它的最后一行。)(metadata tablettablettablettable)

在这里插入图片描述

  1. The client library caches tablet locations.If the client does not know the location of a tablet, or if it discovers that cached location information is incorrect, then it recursively moves up the tablet location hierarchy.
    (client绑定的library会缓存tablet位置。如果client无法定位或者定位出错,)(client绑定的library会缓存tablet位置。如果client无法定位或者定位出错,)(clientlibrarytabletclient)
    (则client会递归地往前一层移动。)(则client会递归地往前一层移动。)(client)
    and it will be two cases:
    (1) If the client’s cache is empty, the location algorithm requires three network round-trips, including one read from Chubby.
    (如果缓存为空,则定位算法需要三次网络往返:client发现无cache;)(如果缓存为空,则定位算法需要三次网络往返:client发现无cache;)(clientcache)
    (client读Chubby文件并cache;client再一次定位。)(client读Chubby文件并cache;client再一次定位。)(clientChubbycacheclient)
    (2) If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses.
    (如果缓存为旧,则定位算法可能需要六次网络往返,因为缓存一般是定位miss时发现为旧:)(如果缓存为旧,则定位算法可能需要六次网络往返,因为缓存一般是定位miss时发现为旧:)(miss)
    (client发现cache;client定位miss;client删除旧cache;client发现无cache;)(client发现cache;client定位miss;client删除旧cache;client发现无cache;)(clientcacheclientmissclientcacheclientcache)
    (client读Chubby文件并cache;client再一次定位。)(client读Chubby文件并cache;client再一次定位。)(clientChubbycacheclient)

  2. Although tablet locations are stored in memory, so no GFS accesses are required, we further reduce this cost in the common case by having the client library prefetch tablet locations: it reads the metadata for more than one tablet whenever it reads the METADATA table.
    (虽然tablet位置已经被缓存在内存,并且GFS已经无需访问。)(虽然tablet位置已经被缓存在内存,并且GFS已经无需访问。)(tabletGFS访)
    (但是为了进一步降低cost,通过client预读tablet位置的方法,一次读取多个metadata的tablet。)(但是为了进一步降低cost,通过client预读tablet位置的方法,一次读取多个metadata的tablet。)(costclienttabletmetadatatablet)

 \space 

Tablet Assignment 数据分布

The master keeps track of the set of live tablet servers by Chubby, and the current assignment of tablets to tablet servers, including which tablets are unassigned.When a tablet is unassigned, and a tablet server with sufficient room for the tablet is available, the master assigns the tablet by sending a tablet load request to the tablet server.
(master通过Chubby来追踪存活tablet servers、目前已分配的tablets到tablet server情况、未分配的tablet情况。)(master通过Chubby来追踪存活tablet \space servers、目前已分配的tablets到tablet \space server情况、未分配的tablet情况。)(masterChubbytablet serverstabletstablet servertablet)

(当某个tablet未分配,且某个tablet由足够空间可用时)(当某个tablet未分配,且某个tablet由足够空间可用时)(tablettablet)
(master会发送一个tablet load request给tablet server。以下是分配的具体过程:)(master会发送一个tablet \space load \space request给tablet \space server。以下是分配的具体过程:)(mastertablet load requesttablet server)

在这里插入图片描述
在这里插入图片描述

  1. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory.
    (当一个tablet server启动,会在特定的Chubby目录下创建一个独一的file,并获取这个文件的排它锁。)(当一个tablet \space server启动,会在特定的Chubby目录下创建一个独一的file,并获取这个文件的排它锁。)(tablet server,Chubbyfile)
  2. The master monitors this directory (the servers directory) to discover tablet servers.
    (master通过directory来获取当前存活的tablet server。)(master通过directory来获取当前存活的tablet \space server。)(masterdirectorytablet server)
  3. A tablet server stops serving its tablets if it loses its exclusive lock.
    (当tabler 丢失了锁后,停止对tablets的服务。)(当tabler \space 丢失了锁后,停止对tablets的服务。)(tabler tablets)
  4. A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists. If the file no longer exists, then the tablet server will never be able to serve again, so it kills itself.
    (tablet server在丢失锁后会尽力重新通过文件取得锁。)(tablet \space server在丢失锁后会尽力重新通过文件取得锁。)(tablet server)
    (但是如果文件不存在,那么tablet server只能kill itself。)(但是如果文件不存在,那么tablet \space server只能kill \space itself。)(tablet serverkill itself)
  5. Whenever a tablet server terminates (e.g., because the cluster management system is removing the tablet server’s machine from the cluster), it attempts to release its lock so that the master will reassign its tablets more quickly.
    (当一个tablet server因为集群管理移除等原因终止时,)(当一个tablet \space server因为集群管理移除等原因终止时,)(tablet server,)
    (它会释放自己的锁以便master可以更快地重分配它地tablets。)(它会释放自己的锁以便master可以更快地重分配它地tablets。)(便mastertablets)
     \space 
Detailed for 1-2
  1. The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations.
    (master在启动时先获取一个独一的master lock,防止master的并发实例化。)(master在启动时先获取一个独一的master \space lock,防止master的并发实例化。)(mastermaster lockmaster)
  2. The master scans the servers directory in Chubby to find the live servers.
    (master通过directory得知存活的server。)(master通过directory得知存活的server。)(masterdirectoryserver)
  3. The master communicates with every live tablet server to discover what tablets are already assigned to each server.
    (master与这些存活的server通信,获取已经分配的tablets信息。)(master与这些存活的server通信,获取已经分配的tablets信息。)(masterservertablets)
  4. The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets.
    (master再浏览metadata table去获取所有tablet,并得到未分配tablets的信息。)(master再浏览metadata \space table去获取所有tablet,并得到未分配tablets的信息。)(mastermetadata tabletablettablets)
  5. One complication is that the scan of the METADATA table cannot happen until the METADATA tablets have been assigned. Therefore, before starting this scan , the master adds the root tablet to the set of unassigned tablets. This addition ensures that the root tablet will be assigned.
    (有一个问题在于:当metadata tablet没有分配好之前,matadata table是不能开始浏览的。)(有一个问题在于:当metadata \space tablet没有分配好之前,matadata \space table是不能开始浏览的。)(metadata tabletmatadata table)
    (因此master在浏览之前,先将root tablet放入未分配tablets中,保证它会被分配。)(因此master在浏览之前,先将root\space tablet放入未分配tablets中,保证它会被分配。)(masterroot tablettablets)

 \space 

Detailed for 3-5
  1. To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock.
    (为了检测tablet server有无服务,master周期地询问server锁状况。)(为了检测tablet \space server有无服务,master周期地询问server锁状况。)(tablet servermasterserver)
  2. If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts, the master attempts to acquire an exclusive lock on the server’s file.
    (若tablet server丢锁或者无法到达master,master会通过它的file尽力获取它的锁。)(若tablet \space server丢锁或者无法到达master,master会通过它的file尽力获取它的锁。)(tablet servermaster,masterfile)
  3. If the master is able to acquire the lock, then Chubby is live and the tablet server is either dead or having trouble reaching Chubby, so the master ensures that the tablet server can never serve again by deleting its server file.Once a server’s file has been deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned tablets.
    (如果能获得锁,则server本身出了问题,通过删除它的file使它不能服务,)(如果能获得锁,则server本身出了问题,通过删除它的file使它不能服务,)(serverfile使)
    (并将原本分配给这个server的tablets转移到其他未分配的server上去。)(并将原本分配给这个server的tablets转移到其他未分配的server上去。)(servertabletsserver)
  4. To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires.
    (当master与Chubby之间的会话过期,master会killitself。但这并不影响已经分配好的tablets。)(当master与Chubby之间的会话过期,master会kill itself。但这并不影响已经分配好的tablets。)(masterChubbymasterkillitselftablets)

 \space 

Changes of the existing tablets
  1. Initiates by master: a table is created or deleted, two existing tablets are merged to form one larger tablet.
    (由master运行的tablet更改:表的增删、两个已存在的tablets合并成一个大的tablet。)(由master运行的tablet更改:表的增删、两个已存在的tablets合并成一个大的tablet。)(mastertablettabletstablet)
  2. Initiates by tablet server:an existing tablet is split into two smaller tablets.The tablet server commits the split by recording information for the new tablet in the METADATA table. When the split has committed, it notifies the master.
    (由tablet server运行的tablet更改:已存在的tablet的拆分。)(由tablet \space server运行的tablet更改:已存在的tablet的拆分。)(tablet servertablettablet)
    (server通过把新的tablet信息记录发给metadata table,)(server通过把新的tablet信息记录发给metadata \space table,)(servertabletmetadata table,)
    (来执行这个拆分。当这个拆分提交时,server会通知master。)(来执行这个拆分。当这个拆分提交时,server会通知master。)(servermaster)

 \space 

Tablet Serving 数据服务

The persistent state of a tablet is stored in GFS(including log and data), as illustrated in Figure 5. Updates are committed to a tablet log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables.

(GFS中存储了log以及数据,并保证了持久性。更新被提交到用于存储重做记录的tablet log中。)(GFS中存储了log以及数据,并保证了持久性。更新被提交到用于存储重做记录的tablet \space log中。)(GFSlogtablet log)
(最新的更新存在一个有序的名为memtable的缓存里。相对旧的更新存在SStable序列中。)(最新的更新存在一个有序的名为memtable的缓存里。相对旧的更新存在SStable序列中。)(memtableSStable)
(以下将结合图5介绍Tablet的三种服务:)(以下将结合图5介绍Tablet的三种服务:)(5Tablet)
在这里插入图片描述

 \space 

Recover a tablet 恢复数据
  1. A tablet server reads its metadata from the METADATA table.
    (一个tablet server从metadata table中读取它的元数据。)(一个tablet \space server从metadata \space table 中读取它的元数据。)(tablet servermetadata table)
  2. This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet.
    (元数据包括一个tablet以及一系列redo point。redo point指向可能包含这个tablet数据的提交日志。)(元数据包括一个tablet以及一系列redo \space point。redo \space point指向可能包含这个tablet数据的提交日志。)(tabletredo pointredo pointtablet)
  3. The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.
    (server将读取到的SSTable的索引载入内存,并通过redo point前的已提交更新重构memtable。)(server将读取到的SSTable的索引载入内存,并通过redo \space point前的已提交更新重构memtable。)(serverSSTableredo pointmemtable)
  4. Then the recovery of tablet will read data from indices of the SSTables and execute the updates from memtable.
    (随后tablet的恢复可以通过内存中的SSTable索引找到数据,并通过memtable中的已提交更新完成恢复。)(随后tablet的恢复可以通过内存中的SSTable索引找到数据,并通过memtable中的已提交更新完成恢复。)(tabletSSTablememtable)

 \space 

Write operation 写操作
  1. The server checks that it is well-formed when a tablet server’s write operation arrives.
    (当一个tablet server的写操作到来,会先检查它的格式。)(当一个tablet \space server的写操作到来,会先检查它的格式。)(tablet server)
  2. The sender is authorized to perform the mutation. Authorization is performed by reading the list of permitted writers from a Chubby file.
    (server会去Chubby file里检查是否有写的资格。)(server会去Chubby \space file里检查是否有写的资格。)(serverChubby file)
  3. A valid mutation is written to the commit log.
    (有效的写操作会被记录到commit log中。)(有效的写操作会被记录到commit \space log中。)(commit log)
  4. After the write has been committed, its contents are inserted into the memtable.
    (写操作被提交后,其内容可以被插入到memtable中。)(写操作被提交后,其内容可以被插入到memtable中。)(memtable)

 \space 

Read operation 读操作
  1. First step is similarly to write operations.
  2. Second step is similarly to write operations
  3. A valid read operation is executed on a merged view of the sequence of SSTables and the memtable.
    (格式检查以及资格检查与写操作类似,不同的是读操作在执行时同时用到SSTable及memtable。)(格式检查以及资格检查与写操作类似,不同的是读操作在执行时同时用到SSTable及memtable。)(SSTablememtable)

 \space 

Compactions 数据压缩

 \space 

Minor compaction 小压缩
  1. As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS.
    (写操作持续进行会增加memtable的容量。其容量到达一定门限值会冻结memtable并产生一个新的memtable。)(写操作持续进行会增加memtable的容量。其容量到达一定门限值会冻结memtable并产生一个新的memtable。)(memtablememtablememtable)
    (冻结的memtable会转为一个SSTable,并写入GFS。)(冻结的memtable会转为一个SSTable,并写入GFS。)(memtableSSTableGFS)
  2. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies.
    (这个过程被称为minor compaction,它有两个目标:缩减tablet server的内存使用率;)(这个过程被称为minor \space compaction,它有两个目标:缩减tablet \space server的内存使用率;)(minor compactiontablet server使)
    (当tablet恢复时减少从memtable中往commit log中读取的数据量。)(当tablet恢复时减少从memtable中往commit \space log中读取的数据量。)(tabletmemtablecommit log)

 \space 

Merging compaction 合并压缩
  1. If this minor compaction continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables.
    (如果不断地通过minor compaction产生新的SSTable且不被检查,)(如果不断地通过minor \space compaction产生新的SSTable且不被检查,)(minor compactionSSTable)
    (读操作可能需要合并来自未知数量SSTable的更新。)(读操作可能需要合并来自未知数量SSTable的更新。)(SSTable)
  2. Instead, we bound the number of such files by periodically executing a merging compaction in the background.
    (因此通过一种merging compaction的方式定期地合并一定数量的file。)(因此通过一种merging \space compaction的方式定期地合并一定数量的file。)(merging compactionfile)
  3. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.
    (它会读取若干个SSTable以及memtable的内容,并将其写到一个新的SSTable中。旧的内容会立刻丢弃。)(它会读取若干个SSTable以及memtable的内容,并将其写到一个新的SSTable中。旧的内容会立刻丢弃。)(SSTablememtableSSTable)

 \space 

Major compaction 大压缩
  1. A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction.
    (一个重新所有SSTable的merging compaction称为majorcompaction。)(一个重新所有SSTable的merging \space compaction 称为major compaction。)(SSTablemerging compactionmajorcompaction)
  2. A major compaction, on the other hand, produces an SSTable that contains no deletion information or deleted data.
    (不同于其他SSTable,由makor compaction生成的SSTable不会由删除信息以及已删除数据的记录。)(不同于其他SSTable,由makor \space compaction生成的SSTable不会由删除信息以及已删除数据的记录。)(SSTable,makor compactionSSTable)
  3. These major compactions allow Bigtable to reclaim resources used by deleted data, and also allow it to ensure that deleted data disappears from the system in a timely fashion, which is important for services that store sensitive data.
    (这种方式由两个好处:使得BigTable可以从已删除数据中回收资源;使得已删除数据及时得到清除。)(这种方式由两个好处:使得BigTable可以从已删除数据中回收资源;使得已删除数据及时得到清除。)(使BigTable使)
    (这对于存储敏感性数据非常重要。)(这对于存储敏感性数据非常重要。)()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值