学习了1个多月,现在回头看,觉得要理解paxos算法,需要阅读下面三篇论文:
The part-time parliament [英文] [中文]
Paxos Made Code [英文]
前两篇更多的是理论,第三篇介绍了paxos的实现。
阅读第三篇还是很有必要的,在看第三篇之前,我一直不理解一个paxos的instance到底指的是什么。
术语
ballot、提案、提议:一轮投票
决议、decree:状态proposed候选的值,同时发起的请求的值,可能取的值;状态passed确定的唯一值,一致的值,写到律簿的值
投票、vote:同意该提案的值
承诺、promise:出现在承诺不再接受某些提案
律簿、法典、ledger:记录通过的决议,包括一个编号,以及对应的内容。
实例编号、instance No、decree number:对应律簿中的编号
提案编号、ballot No:每个提案的编号不能重复
法定人数集、quorum:牧师这群人中的一部分,一般是多数派就行。
Leader、president:从proposer中选出来的,只让它发起提案
Basic Paxos算法能从众多的请求中唯一地确定一个值,Multi Paxos则是多次运行Paxos实例,从而得到一个唯一的序列。
Paxos算法角色
Proposer 发出提议 (提案:一个提案proposal,提案中含有决议(value))
Acceptor 参与投票
Learner 学习通过的决议
Acceptor被动参与投票。Paxos算法规定了Proposer和Acceptor的原则。
一个一致性算法需要保证:一次选举中只批准一个决议(value),只有被提出(proposed)的决议才能被批准,只有已经被批准的决议才能被学习(即可以执行或保存这个决定的内容)。
1.决议(value)只有在被 proposers 提出后才能批准(未经批准的决议称为“提案(proposal)”);
2.在一次 Paxos 算法的执行实例中,只批准一个 Value;
3.learners 只能获得被批准(chosen)的Value。
safety condition:If in round M a proposal V is chosen, then every higher-numbered proposal must have value V.
数学结论(β:表决集合)
B2(β):β中任意两个表决的法定人数集(quorum)至少有一个牧师是相同的。
B3(β):对β中任一表决B,如果B的法定人数集中有牧师在之前的表决投过赞成票。那么,表决B的 法令的内容应与那些投票的表决中最近的那轮的法令一致。
| # | decree | |||||
| 2 | α | A | B | F | ∆ | |
| 5 | β | A | B | F | E | |
| 14 | α | B | ∆ | E | ||
| 27 | β | A | F | ∆ | ||
| 29 | β | B | F | ∆ |
A1:因为可能有多个Proposer提出议案,它们提出的议案的编号是没有交集的。每个牧师都有一个无限的表决编号集合单独供自己使用。
满足B1(β), B2(β), B3(β)
保证一致性:同一个法令编号,一旦达成结果。今后都将会是同一个结果,因为B3,无法篡改。只达成一条一致的法令
不保证进展性
初级协议 Preliminary protocol
使B2成立: 选择法定人数集
使B3成立:p在发起表决前,需要找出Maxvote(b,Q,β)dec,也就是对于Q中的每一个q找出Maxvote(b,p,β)(编号最大的投票)。
一个牧师p在发起表决之前,需要找出MaxVote(b, Q,β)dec,为了找出MaxVote(b, Q,β)dec,p需要对Q中的每一个q找出MaxVote(b, q, β)dec。
------ 这就是为什么在 request 的回复中需要夹带编号小于 b 的表决中编号最大的表决的投票。初级协议- 步骤
(2)q回复Lastvote(b,v),v=Maxvote(b,q,β)
因为β会改变,要保证p选择Maxvote(b,Q,β)之后Maxvote(b,Q,β)不变,q承诺不再向 vbal ~ b之间的表决投票。
(3)p收到 Q中所有人/ 多数派牧师的Lastvote(b,v)后, 选择满足B3(β)的法令d,并向Q中每一个牧师发送BeginBallot(b,d)
(5)P收到 Q中所有人/多数派牧师q的vote(b,q),则在律簿中写下法令d,并向每一个牧师发送success(d)
(6)Q收到success(d)之后,将法令d记录在律簿上
(3)、(4)选值、投票
(5)、(6)通过则学习
1) 每个发起的提议的编号
2)每个对提议的投票
3)每个LastVote
基本协议 Basic Protocol
1) LastTried[p] 最后一次发起的提议的编号
2) preVote[p] 投票最大编号,第4步投票的编号。
3) nextBal[p] 回复prepare的最大编号 ,也是 承诺的编号,第2步的编号。
lastTried[p]:p发起的最后一个表决编号,没有则为负无穷。
prevVote[p]:p投票的所有表决中,编号最大的表决对应的p的投票。
nextBal[p]:p发出的所有LastVote(b,v)消息,表决编号b的最大值。
(1) p选择比LastTried[p]大的编号b,设置LastTried[p]为b,然后发送NextBallot(b)
(2) q收到大于nextBal[q]的NextBallot(b)消息后,牧师将nextBal[q]设置为b,然后发送一个Lastvote(b,v)消息给p,其中v等于prevote[q]
更强的承诺:不再对任何编号 小于b的表决投票
(3) P发送BeginBallot(b,d)消息给Q中的每一个牧师
(4) 投票,设置prevote[q]为这一票,然后发送voted(b,q)
(5) P收到Q中每一个q的vote(b,q)之后,记录d到律簿上,发送success(d)给每个牧师
(6) 收到success(d),写法令
Basic Paxos
Phase 1b:Acceptor返回Promise(Ballot id, Accepted Id, Accepted Value)或者NACK
Phase 2a:Proposer收到多数派的Promise之后,根据B3选择值V, 发送Accept(Ballot id, V)
Phase 2b:Acceptor发送Accepted(Ballot id)或UnAccepted,vote
Multi-paxos
每次运行一次Paxos,唯一选出一个一致的值( 一个实例instance)
Proposer开始第i个Paxos实例
发送prepare(instance id, ballot id)
回复promise(instance id, ballot id, acc id, acc val)
发送accepted(instance id, ballot id, V)
回复accept/unaccept
收到多数派的accept,实例i的值确定为V。
…
Paxos Made Simple
P1: An acceptor must accept the first proposal that it receives.
A proposal is chosen when it has been accepted by a majority of acceptors. accepted不是accept,是指promise,也就是phase 1b
A value is chosen when a single proposal with that value has been chosen.
提案被接收:Acceptor只是保证不承诺比该propose的ID小的,如果后来出现ID更大的,则会在后面的阶段中拒绝该propose。
决议被批准:提案被Acceptors集合中的任意一个Majority的所有成员接受。
Unique value( by induction on the proposal number )
P2: If a proposal with value v is chosen,then every higher-numbered proposal that is chosen has value v.
-strengthen->
P2a: If a proposal with value v is chosen,then every higher-numbered proposal accepted by any acceptor has value v.
-strengthen->
P2b: If a proposal with value v is chosen,then every higher-numbered proposal issued by any proposer has value v.
Suppose p wants to issue a proposal numbered n.
If p can be certain that no proposal numbered n’<n has been chosen then p can propose any value!
If not, p should propose the value of the highest numbered proposal among all accepted proposal among all accepted proposals numbered less than n.
P2c: For any v and n, if a proposal with value v and number n is issued, then there is a set S consisting of a majority of acceptors such that either:
No acceptor in S has accepted any proposal numbered less than n, or
V is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S.
acceptor需要记住什么?
An acceptor needs only remember the highest numbered proposal it has accepted and the number of the highest-numbered prepare request to which it has responded.
Paxos总结
学习的进度是异步的。可能实例还没写到Learner的律簿上,所以学习的进度不一样。
Q&A
1) http://stackoverflow.com/questions/5850487/questions-about-paxos-implementation/10151660#10151660
An instance is the algorithm for choosing one value.
A round refers to a proposer’s single attempt of a Phase 1+Phase 2. A node can have multiple rounds in a instance of Paxos. A round id is globally unique per instance across all nodes. This is sometimes called proposal number.
Round ids(aka proposal numbers) should be increasing and must be unique per instance across all nodes.
One scheme: roundId = i*M + index[node]where I is the ith round this node is starting(that is I is unique per node per paxos instance, and is monotonically increasing).
Instance reserved:Promise中有value
Instance close:被多数派accepted
Acceptor有状态<idd, B, V, VB>,它知道自己已经对instance做出/没做出决定.
2) http://stackoverflow.com/questions/10791825/implementation-of-paxos-algorithm
Q:if there is only one “distinguished”proposer who is able to issue proposal, then what is the difference of paxos algorithm from 2-phase commit algorithm?
A:The Distinguished Proposer is an optimization and is not a requirement for the algorithm. The Distinguished Proposer reduces the contention of two proposer sleap-frogging prepare/accept messages, which is important if you want the instance to finish. In this model, a node forwards the request to the Distinguished Proposer instead of proposing for itself.If it thinks the Distinguished Proposer is dead then it just proposes for itself.(it does not have to be/it can never be 100% confident the Distinguished Proposer is dead.)
3) Paxos is expensive enough that many systems using it use hints as well, or use it only for leader election, or something.However, it does provide guaranteed consistency in the presence of failures.
4) 保证只有一个Leader提出proposal,可以优化,减少消息的条数
The parliamentary protocol[aka Mult-paxos] used a separate instance of the complete Synod protocol [aka Paxos] for each decree number.However, a single president [aka proposer/leader] was selected for all these instances, and he performed the first two steps of the protocol just once.
《Paxos Made Live》
A Mult-Paxos algorithm may be designed to pick a coordinator for long periods of time, trying not to let the coordinator change.
With this optimization, the Paxos algorithm only requires a single write to disk per Paxos instance on each replica, executed in parallel with each other.
多个Proposer发起提案可能造成race condition,甚至活锁,从而无法满足Liveness。所以一般都会从中选择出一个Leader,由Leader提出Proposal。
You have a way of determining which node is the Leader per Paxos instance(required by Multi-Paxos). The Leader is able to change from one Paxos instance to the next.
Q: As for protocol violation, I think it's already a violation of the Basic protocol to send Accept!(N+1,V2) since this wouldn't have happened with Basic Paxos (you'd have sent Prepare!(N+1) first,would have received V1 from acceptors in return and would have been forced tosend Accept! with V1 afterwards).
A: From the point of view of an Acceptor, it can receive an Accept without having ever seen the corresponding Prepare. Yes it is inviolation of Basic Paxos, but not in violation of Multi-Paxos.
Prepare messages have form (instance, proposal_num)
Propose messages have form (instance, proposal_num, proposal_val)
In Phase 1a, there is no need to send the value to agree on.
7) http://www.cnblogs.com/chen77716/archive/2011/01/27/2130804.html
大家可能会比较疑惑,难道自始至终只能选出一个value?其实这里的选出,是指一次选举,而不是整个选举周期,可以多次运行paxos,每次都只选出一个value.8) 需要向所有acceptor发送prepare消息吗?
A coordinator can send a phase 1a or 2amessage only to a majority of acceptors that it believes are nonfaulty. It canlater send the message to other acceptors if one of the original acceptors doesnot respond. <Fast paxos>
9) 关于提请求:<Paxos Made Code>
To submit values, clients connects to the current leader. In case the coordinator crashes, they connect to the next elected leader.
Alternatively, clients can send their values by multicasting them on the “leader network”.A coordinator crash is completely transparent to them in this case.
The leader is busy broadcasting client values as fast as possible. When received, those values are temporarily stored in a pending list.10) Multi-Paxos
一个paxos instance怎样才能算是closed?
If a majority of acceptors accepted the value, it is safe to assume that the instance is closed with v permanently associated to it. Nothing can happen in the system that will changes this fact, therefore learners can deliver the value to their clients.
Learners’ state consists of a window of instances.
Lower bound of this window is the lowest instance not yet closed;
Higher bound must be chosen
11) 学习:
在一个总统可以提议任何新的法令之前,他必须向一个多数(过半的)集合中的每个议员学习他已经投票过的法令。任何已经通过的法令一定被至少一个多数集合中的议员投过票。
Learner学习可以通过Acceptor发送Accept广播,收到一个majority的Accepted通知之后开始学习。Acceptor也可以将Accept消息只发送给Leader,然后Leader收到多数派Accept消息后,再发送Commit给Learner通知该实例的值已确定,可以写到律簿中,也就是学习。
12) 宕机:
Acceptor的多数派宕机后,如果proposer也宕机,则新的proposer请求因为无法得到多数派的回应而需要递增投票的编号,直到Acceptor有多数派恢复在线。同时,新的proposer无法改变该instance的值,因为Acceptor中有记录。
多数机器宕机也不会造成不一致,只是会推迟结果的达成。Live if a quorum of processes are OK (up and connected with each other)
13) 消息格式及状态:《Paxos Made Code》
Acceptorstate:
the acceptor keeps a state record,consisting of < iid, B, V, V B >, where B is an integer, the highestballot number that was accepted or promised for this instance, V is a clientvalue and V B is the ballot number corresponding to the accepted value. Thethree fields are initially empty.
Promise:It sets B = b and answers with a promise message, consisting of< i, b, V, V B >, where V and V B are null if no value was accepted yet.
Accepted: It sets B = b, V = v,V B = b and answers with a learn message,consisting of < i, b, v >.
14) 如果形成多数派怎么办. 《Paxos Made Code》
如果没有通过promise,则增加ballot number,重发prepare。
如果没有通过accept,则增加ballot number,重发prepare。从头开始
直到instance 被close。
15) Persistency
在发送消息之前,需要写到磁盘上,否则可能会导致对promise食言。即使写到磁盘后,没有发出去或者消息丢失,也只是影响进展。
16) 如果都没有人学习到? 但是确实有多数派accepted了,则下一轮至少有一个会在多数派中,缺少这些肯定构不成多数派。应该是还是会取这个值,否则能推出矛盾,意味着有follower的accepted Id更大,那原本就不是取这个值了。
Update: 2013.01.05
=======================
17) libpaxos中在phase 2a之前为什么可能有取值?
在p1b或这p2a发生超时的情况下,需要从头开始。但是如果p2a超时,实际上这个实例可能已经有取值了,也就是和某个request绑定。为了不丢失这个请求,一种方法是将该请求放回到缓冲队列。但是这种做法可能出现该请求在当前的实例为reserved value,也就是多数派的acceptor已经通过该值。而下一个或者后续的实例又从缓冲队列中取到该值,从而出现请求在多个实例中重复出现的情况。
所以,不应该重新放回到缓冲队列,而应该继续和实例绑定。只有发现reserved value是另外一个值的情况下才将该请求放回缓冲队列。
参考资料:
http://blog.csdn.net/techq/article/details/7337210
本文详细介绍了Paxos算法,包括基本概念、角色、协议步骤和优化方案。通过学习Paxos Made Simple等论文,理解了Paxos如何确保一致性。Paxos涉及到Proposer、Acceptor和Learner三个角色,以及Basic Paxos和Multi-Paxos协议。在Paxos中,Proposer发起提案,Acceptor投票,Learner学习最终决议。算法的关键在于保证提案的唯一性和决议的一致性。

1万+

被折叠的 条评论
为什么被折叠?



