共识算法(2) - Practical Byzantine Fault Tolerance论文学习笔记

Service Properties

  1. 算法的适用范围

Our algorithm can be used to implement any deterministic replicated service with a state and some operations.

  • deterministic :确定性服务,即相同的参数执行相同操作,每次得到的结果都是一致的,不存在随机性(the execution of an operation in a given state and with a given set of arguments must always produce the same result)
  • replicated : 多副本
  • a state and some operations : 服务有相同的初始状态及相同的操作,保证当按相同顺序执行完相同的操作,各个副本的即时状态是完全一致的 (must start in the same state)
  1. 在异步系统中要容忍ff个恶意节点,则至少需要3f+13f+1个节点才能提供safety and liveness(Asynchronous Consensus and Broadcast Protocols这篇论文中证明)。
  • safety : 副本服务满足线性一致性,像在中心化服务中同时原子性的执行操作。不受恶意client数量影响,恶意client的操作可以通过被正常client感知,利用access control来限制。
  • liveness : 利用同步提供liveness(论文证明)。只要满足3f+13f+1个节点,可以保证client持续重传,并且到达destination的传输时间没有无限的增长情况下,client最终会收到回复。
  1. 算法不提供隐私容错性,即隐私信息会泄漏给恶意节点

The algorithm does not address the problem of fault- tolerant privacy: a faulty replica may leak information to an attacker.

The Algorithm

0. 术语

  • view :类似raft的term,view number在变更时+1
  • primary :在一个view中只有单独一个节点pp成为primary,选择的方式为p=vmodRp = v mod |R| (pp为节点的id,vv是当前view number,R=3f+1|R| = 3f+1)
  • backup : 一个view内除primary外的其他所有节点,与primary都称为replica
    算法的粗略过程:
  1. A client sends a request to invoke a service operation to the primary
  2. The primary multicasts the request to the backups
  3. Replicas execute the request and send a reply to the client
  4. The client waits for f+1f + 1 replies from different replicas with the same result; this is the result of the operation.

1. The Client

client cc点对点发送<REQUEST,o,t,c>σc<REQUEST,o,t,c>_{\sigma_c}到primary,然后由primary广播到所有backup。oo是操作,tt是时间(如本地机器时间),用来保证exactly-once语义,<>σc<>_{\sigma_c}表示对消息签名(下同)。replica ii相应的回复<REPLY,v,t,c,i,r>σi<REPLY,v,t,c,i,r>_{\sigma_i}vv是当前view number(存在view-change情况),rr是执行结果,然后各自对消息签名。client可以根据回复的view number跟踪当前view,相应的得知当前primary。client接收到f+1f+1个相同结果的有效回复(校验签名),则结果有效,完成。

如果client在有效时间内没法收到f+1f+1个有效结果,则广播request到所有的replica。如果该request已经被处理过,replica就简单的重新发送reply到client。如果没有,backup会将改request重定向到primary。如果primary没有再次将该request广播,则会被认为是异常节点,最终触犯view change。

2. Normal-Case Operation

replica记录的状态

The state of each replica includes the state of the service, a message log containing messages the replica has accepted, and an integer denoting the replica’s current view.

整个流程分为三阶段,pre-prepare, prepare, commit (pre-prepare, prepare用于保持同一个view内request的有序性,prepare, commit用于保证跨view的reqeust的有序性)。

  1. pre-prepare 阶段
    primary广播<<PREPREPARE,v,n,d>σp,m><<PRE-PREPARE,v,n,d>_{\sigma_p}, m>到所有的backup。nn是primary分配给该消息的序号,全局递增,mm是client的请求消息,dd是消息的digest摘要。然后签名。

    如果满足以下条件,则backup会接受该pre-prepare请求,并且进入prepare阶段(The last condition prevents a faulty primary from exhausting the space of sequence numbers by selecting a very large one.):

    • the signatures in the request and the pre-prepare message are correct and dd is the digest for mm;
    • it is in view vv;
    • it has not accepted a pre-prepare message for view vv and sequence number nn containing a different digest;
    • the sequence number in the pre-prepare message is between a low water mark, hh, and a high water mark, HH.
  2. prepare 阶段
    backup广播<PREPARE,v,n,d,i>σi<PREPARE,v,n,d,i>_{\sigma_i}消息到所有replica(包含primary和backup),ii为该节点id。并且将<PREPREPARE><PRE-PREPARE><PREPARE><PREPARE>消息保存到本地log。replica接收到prepare消息后,校验签名正确,vv是当前view,nn在水位hhHH之间,摘要dd和本地保存的PREPREPAREPRE-PREPARE消息里的dd相等。当replica ii收到2f2f个这样的PREPAREPREPARE消息后,就达到了prepared(m,v,n,i)prepared(m,v,n,i) 状态,进入commit阶段。

  3. commit 阶段
    replica ii广播<COMMIT,v,n,D(m),i>σi<COMMIT,v,n,D(m),i>_{\sigma_i}消息到所有replica。replica收到后校验通过后,写入log。如果replica ii达到了prepared(m,v,n,i)prepared(m,v,n,i) 状态,并且收到2f+12f+1个(包括自身的commit消息)校验通过的commit(vv,nn,dd相同且与log里记录的prepare吻合),则达到committedlocal(m,v,n,i)committed-local(m,v,n,i)状态,并且其当前状态是消息序号小于nn的之前所有消息按序执行后的最新最全结果,则该replica执行mm里指定的操作。操作的结果返回到client。

    The commit phase ensures the following invariant: if committedlocal(m,v,n,i)committed-local(m,v,n,i) is true for some non-faulty then committed(m,v,n)committed(m,v,n) is true. This invariant and the view-change protocol described in Section 4.4 ensure that non-faulty replicas agree on the sequence numbers of requests that commit locally even if they commit in different views at each replica. Furthermore, it ensures that any request that commits locally at a non-faulty replica will commit at f+1f+1 or more non-faulty replicas eventually.

    这里指的是,达到committed(m,v,n)committed(m,v,n)只能保证至少f+1f+1个正常节点被最终commit,而并不是所有正常节点都执行。这就造成正常节点间都状态不一致,需要最终同步状态。或者使用上文提到Section 4.4里提到都view-change场景下重走commit流程最终达成状态一致。[引用2:Fabric0.6里的最终状态同步方案,3:介绍实时全正常节点同步的方案]

3. Garbage Collection

replica周期性对其本地状态生成快照,称为checkpoint。当replica ii生成checkpoint同时,广播<CHECKPOINT,n,d,i>σi<CHECKPOINT,n,d,i>_{\sigma_i}消息到其他所有replica,其中,nn是生成该状态执行的最后的请求消息的序号,dd是该状态的摘要digest。replica接收checkpoint消息,并且记录到log里,当收到2f+12f+1个相同的checkpoint消息(nn,dd)相同,则该checkpoint变成stable checkpoint。此时,小于等于nn的消息(preprepare,prepare,commit)(pre-prepare, prepare, commit)都会被丢弃,同时,之前的stable checkpoint也被丢弃。

实际上,一个replica会同时持有这样几个状态副本:一个stable checkpoint,若干个未达到稳定状态的checkpoint(生成后暂未收集到2f+12f+1个该状态的checkpoint消息),以及当前状态。stable checkpoint的最后的消息序号nn就是低水位hh,高水位H=h+kH = h + k(where is big enough so that replicas do not stall waiting for a checkpoint to become stable).

4. View Changes

client在有效时间内没法收到f+1f+1个有效结果,则广播request到所有的replica。backup收到后,若该请求从未执行,则转发给primary,并且启动timer,在timer到期前若没有收到primary发出的pre-prepare,则判断该primary异常,发出view-change消息。backup ii, <VIEWCHANGE,v+1,n,C,P,i>σi<VIEW-CHANGE,v+1,n,C,P,i>_{\sigma_i}v+1v+1是新的view number,nn是该backup的stable checkpoint ss的最后消息序号,CC是证明ss2f+12f+1个checkpoint消息,PPPmPm的集合,PmPm是达到prepared(m,v,n,i)prepared(m,v,n,i) 状态的消息(包含pre-prepare消息和2f2f个来自其他replica的matching消息,其中消息mm的序号大于nn)。

v+1v+1的primary(通过p=v+1modRp = (v+1)mod |R|计算)收到2f2f个有效view-change消息后,广播<NEWVIEW,v+1,V,O>σp<NEW-VIEW,v+1,V,O>_{\sigma_p}VV是收到的和primary自身的view-change消息的集合,OO是pre-prepare消息的集合。计算如下,比较复杂

  1. The primary determines the sequence number minsmin-s of the latest stable checkpoint in VV and the highest sequence number maxsmax-s in a prepare message in VV.
  2. The primary creates a new pre-prepare message for view v+1v+1 for each sequence number nn between minsmin-s and maxsmax-s.
    There are two cases: (1) there is at least one set in the component of some view-change message in with sequence number nn, or (2) there is no such set.
    In the first case, the primary creates a new message <PREPREPARE,v+1,n,d>σp<PRE-PREPARE,v+1,n,d>_{\sigma_p} where dd is the request digest in the pre-prepare message for sequence number nn with the highest view number in VV.
    In the second case, it creates a new pre-prepare message <PREPREPARE,v+1,n,d(null)>σp<PRE-PREPARE,v+1,n,d(null)>_{\sigma_p} where d(null)d(null) is the digest of a special null request; a null request goes through the protocol like other requests, but its execution is a no-op. (Paxos [18] used a similar technique to fill in gaps.)

也就是说,primary从latest stable checkpoint重放之后所有达到prepared(m,v,n,i)prepared(m,v,n,i)状态的消息(mins<n<=maxsmin-s<n<=max-s),重发pre-prepare。replica会对这些消息重走流程,prepare和commit阶段,但是不再重新执行请求,因为在上一个view内,对这些达到prepared(m,v,n,i)prepared(m,v,n,i)状态的消息已经执行过。

同时,replica记录收到的所有replica的view-change里最新的(nn最大)stable checkpoint记录到log里,并且对缺失的消息mm和最新的stable checkpoint(自身的stable checkpoint不是全局最新的stable checkpoint情况下)从其他的replica同步。

5. Correctness(重点)

  1. Safety

    if prepared(m,v,n,i)prepared(m,v,n,i) is true then prepared(m,v,n,j)prepared(m',v,n,j) is false for any non-faulty replica jj(including i=ji=j) and any mm' such that D(m)!=D(m)D(m')!=D(m)

  • 这是因为,到达prepare(m)prepare(m)意味着2f+12f+1个replica选择发送prepare(m)prepare(m),假设异常节点个数为k(k<=f)k(k<=f), 则至少有2f+1k2f+1-k个正常replica发送prepare(m)prepare(m),同时也意味着至少有2f+1k2f+1-k个replica发送prepare(m)prepare(m')。由于(2f+1k)2>3f+1k(2f+1-k)*2>3f+1-k,说明这两部分正常节点有重叠,则至少有一个正常节点发送两个不同消息,矛盾。

    The view-change protocol ensures that non-faulty replicas also agree on the sequence number of requests that commit locally in different views at different replicas. committed(m,v,n)committed(m,v,n) is true if and only if prepared(m,v,n,i)prepared(m,v,n,i) is true for all ii in some set of f+1f+1 non-faulty replicas. if committedlocal(m,v,n,i)committed-local(m,v,n,i) is true for some non-faulty ii then committed(m,v,n)committed(m,v,n) is true.

  • 这是因为,到达committedlocal(m,v,n,i)committed-local(m,v,n,i)状态,则意味着至少收到f+1f+1个正常replica的commit消息,也就是至少有f+1f+1个正常节点到达preparedm,v,n,i)prepared(m,v,n,i)状态。而正常节点只有在接收到new-view消息才会从view vv进入view v+1v+1,而new-view消息里包含2f+12f+1个view-change消息。与上面推导方式相似,则这两个集合必定有一个交集包含正常节点k。若然mm已经包含在kk的stable checkpont内,则最终被同步到所有正常节点;若然没有,则包含kk的view-change消息内,然后在view v+1v+1中被重新执行三阶段流程,最终也达到提交一致。

  1. Liveness
  • replica广播view-change进入v+1v+1,等待2f+12f+1个view-change消息后启动计时器TT。如果在TT到时前没有收到new-view(新的primary是异常节点),则进入v+2v+2,计时器时间扩大成2T2T,这是为了避免view改变的过快。
  • 在定时器过期前,如果收到f+1f+1个view-change消息,并且比自身的view number大,则发送这些大view number中最小的一个
  • f+1f+1个view-change消息才能触发view变更,而且primary的轮流选择方案也限定连续最多f轮选择就能选举到正常的primary。

Optimization

1. Reducing Communication

  1. 避免传输large result

A client request designates a replica to send the result; all other replicas send replies containing just the digest of the result.(指定单个replca回复完整result,其他回复该result的摘要用以校验)。 If the client does not receive a correct result from the designated replica, it retransmits the request as usual, requesting all replicas to send full replies.

  1. 减少流程步骤,replica提前返回reply。当replica收到足够prepare消息,达到了prepared(m,v,n,i)prepared(m,v,n,i) 状态,将要进入commit阶段。这时,如果小于该消息序号nn的之前的消息都已经执行生成了当前状态,则replica执行该请求,并且直接返回到client,同时继续后续的commit阶段。
  • The client waits for 2f+12f + 1 matching tentative replies. If it receives this many, the request is guaranteed to commit eventually. Otherwise, the client retransmits the request and waits for f+1f + 1 non-tentative replies.
  • A request that has executed tentatively may abort if there is a view change and it is replaced by a null request. In this case the replica reverts its state to the last stable checkpoint in the new-view message or to its last checkpointed state (depending on which one has the higher sequence number).
  1. 优化read-only操作。对于read-only请求,client直接发送到所有replica,replica直接返回结果到client。client收集2f+12f+1个相同结果的回复,否则按照正常流程重新发起请求。

They(replica) send the reply only after all requests reflected in the tentative state have committed; this is necessary to prevent the client from observing uncommitted state.

题外话

所有的replica间需要全联接并且多次交互通信,系统中这些消息随着节点数指数增长。单次请求的最少消息数为1+3f+3f(3ff)+(3ff+1)(3f+1)+3f11 + 3f + 3f(3f-f) + (3f-f+1)(3f+1) + 3f-1.

pBFT— Understanding the Consensus Algorithm

私货

  1. 为什么需要至少3f+13f+1个节点?
    首先最后达成共识的正常节点数需要大于ff,否则这f个异常节点就可以达成错误的共识。即至少需要f+1f+1个正常节点达成共识。而达成共识时,有可能给最后共识投票的包含这f个异常节点,所以需要至少2f+12f+1个节点投票达成共识,这样可以保证其中至少含有f+1f+1个正常节点。(例如P为异常节点向不同的正常节点提议不同的值,这也是造成上文提到3f+13f+1个节点最后达成共识时只能保证多于f+1f+1个)。而最后的共识只能有一个,即达成共识的正常节点应该超过所有正常节点数的过半,否则ff个异常节点可以重复投票导致最后正常节点分裂成两组以上各自达成共识。因此,实际上最后的阈值T>=lower((Rf)/2)+1+fT>=lower((R-f)/2)+1+f,这样就保证了correctness正确性。接下来考虑liveness可用性。因为异常节点有可能不参与投票(P为正常节点),需要保证正常节点总数应该大于阈值,即Rf>=TR-f>=T

References

[1] M. Castro and B. Liskov. Practical Byzantine Fault Tolerance. Proceedings of the Third Symposium on Operating Systems Design and Implementation, New Orleans, USA, February 1999

[2] 区块链PBFT共识算法节点主动恢复设计与实现
[3] 拜占庭将军问题深入探讨