共识算法(2) - Practical Byzantine Fault Tolerance论文学习笔记

Service Properties

算法的适用范围

Our algorithm can be used to implement any deterministic replicated service with a state and some operations.

deterministic ：确定性服务，即相同的参数执行相同操作，每次得到的结果都是一致的，不存在随机性（the execution of an operation in a given state and with a given set of arguments must always produce the same result）
replicated : 多副本
a state and some operations : 服务有相同的初始状态及相同的操作，保证当按相同顺序执行完相同的操作，各个副本的即时状态是完全一致的 (must start in the same state)

在异步系统中要容忍 $f$ 个恶意节点，则至少需要 $3f+1$ 个节点才能提供safety and liveness（Asynchronous Consensus and Broadcast Protocols这篇论文中证明）。

safety : 副本服务满足线性一致性，像在中心化服务中同时原子性的执行操作。不受恶意client数量影响，恶意client的操作可以通过被正常client感知，利用access control来限制。
liveness : 利用同步提供liveness（论文证明）。只要满足 $3f+1$ 个节点，可以保证client持续重传，并且到达destination的传输时间没有无限的增长情况下，client最终会收到回复。

算法不提供隐私容错性，即隐私信息会泄漏给恶意节点

The algorithm does not address the problem of fault- tolerant privacy: a faulty replica may leak information to an attacker.

The Algorithm

0. 术语

view ：类似raft的term，view number在变更时+1
primary ：在一个view中只有单独一个节点 $p$ 成为primary，选择的方式为 $p = v mod |R|$ ( $p$ 为节点的id， $v$ 是当前view number， $|R| = 3f+1$ )
backup ：一个view内除primary外的其他所有节点，与primary都称为replica
算法的粗略过程：

A client sends a request to invoke a service operation to the primary

The primary multicasts the request to the backups

Replicas execute the request and send a reply to the client

The client waits for $f + 1$ replies from different replicas with the same result; this is the result of the operation.

1. The Client

client $c$ 点对点发送 $<REQUEST,o,t,c>_{\sigma_c}$ 到primary，然后由primary广播到所有backup。 $o$ 是操作， $t$ 是时间（如本地机器时间），用来保证exactly-once语义， $<>_{\sigma_c}$ 表示对消息签名（下同）。replica $i$ 相应的回复 $<REPLY,v,t,c,i,r>_{\sigma_i}$ 。 $v$ 是当前view number（存在view-change情况）， $r$ 是执行结果，然后各自对消息签名。client可以根据回复的view number跟踪当前view，相应的得知当前primary。client接收到 $f+1$ 个相同结果的有效回复（校验签名），则结果有效，完成。

如果client在有效时间内没法收到 $f+1$ 个有效结果，则广播request到所有的replica。如果该request已经被处理过，replica就简单的重新发送reply到client。如果没有，backup会将改request重定向到primary。如果primary没有再次将该request广播，则会被认为是异常节点，最终触犯view change。

2. Normal-Case Operation

replica记录的状态

The state of each replica includes the state of the service, a message log containing messages the replica has accepted, and an integer denoting the replica’s current view.

整个流程分为三阶段，pre-prepare, prepare, commit (pre-prepare, prepare用于保持同一个view内request的有序性，prepare, commit用于保证跨view的reqeust的有序性）。

pre-prepare 阶段
primary广播 $<<PRE-PREPARE,v,n,d>_{\sigma_p}, m>$ 到所有的backup。 $n$ 是primary分配给该消息的序号，全局递增， $m$ 是client的请求消息， $d$ 是消息的digest摘要。然后签名。

如果满足以下条件，则backup会接受该pre-prepare请求，并且进入prepare阶段(The last condition prevents a faulty primary from exhausting the space of sequence numbers by selecting a very large one.):
- the signatures in the request and the pre-prepare message are correct and $d$ is the digest for $m$ ;
- it is in view $v$ ;
- it has not accepted a pre-prepare message for view $v$ and sequence number $n$ containing a different digest;
- the sequence number in the pre-prepare message is between a low water mark, $h$ , and a high water mark, $H$ .
prepare 阶段
backup广播 $<PREPARE,v,n,d,i>_{\sigma_i}$ 消息到所有replica（包含primary和backup）， $i$ 为该节点id。并且将 $<PRE-PREPARE>$ 和 $<PREPARE>$ 消息保存到本地log。replica接收到prepare消息后，校验签名正确， $v$ 是当前view， $n$ 在水位 $h$ 和 $H$ 之间，摘要 $d$ 和本地保存的 $PRE-PREPARE$ 消息里的 $d$ 相等。当replica $i$ 收到 $2f$ 个这样的 $PREPARE$ 消息后，就达到了 $prepared(m,v,n,i)$ 状态，进入commit阶段。
commit 阶段
replica $i$ 广播 $<COMMIT,v,n,D(m),i>_{\sigma_i}$ 消息到所有replica。replica收到后校验通过后，写入log。如果replica $i$ 达到了 $prepared(m,v,n,i)$ 状态，并且收到 $2f+1$ 个（包括自身的commit消息）校验通过的commit（ $v$ , $n$ , $d$ 相同且与log里记录的prepare吻合），则达到 $committed-local(m,v,n,i)$ 状态，并且其当前状态是消息序号小于 $n$ 的之前所有消息按序执行后的最新最全结果，则该replica执行 $m$ 里指定的操作。操作的结果返回到client。

The commit phase ensures the following invariant: if $committed-local(m,v,n,i)$ is true for some non-faulty then $committed(m,v,n)$ is true. This invariant and the view-change protocol described in Section 4.4 ensure that non-faulty replicas agree on the sequence numbers of requests that commit locally even if they commit in different views at each replica. Furthermore, it ensures that any request that commits locally at a non-faulty replica will commit at $f+1$ or more non-faulty replicas eventually.

这里指的是，达到 $committed(m,v,n)$ 只能保证至少 $f+1$ 个正常节点被最终commit，而并不是所有正常节点都执行。这就造成正常节点间都状态不一致，需要最终同步状态。或者使用上文提到Section 4.4里提到都view-change场景下重走commit流程最终达成状态一致。[引用2:Fabric0.6里的最终状态同步方案,3:介绍实时全正常节点同步的方案]

3. Garbage Collection

replica周期性对其本地状态生成快照，称为checkpoint。当replica $i$ 生成checkpoint同时，广播 $<CHECKPOINT,n,d,i>_{\sigma_i}$ 消息到其他所有replica，其中， $n$ 是生成该状态执行的最后的请求消息的序号， $d$ 是该状态的摘要digest。replica接收checkpoint消息，并且记录到log里，当收到 $2f+1$ 个相同的checkpoint消息（ $n$ , $d$ ）相同，则该checkpoint变成stable checkpoint。此时，小于等于 $n$ 的消息 $(pre-prepare, prepare, commit)$ 都会被丢弃，同时，之前的stable checkpoint也被丢弃。

实际上，一个replica会同时持有这样几个状态副本：一个stable checkpoint，若干个未达到稳定状态的checkpoint（生成后暂未收集到 $2f+1$ 个该状态的checkpoint消息），以及当前状态。stable checkpoint的最后的消息序号 $n$ 就是低水位 $h$ ,高水位 $H = h + k$ (where is big enough so that replicas do not stall waiting for a checkpoint to become stable).

4. View Changes

client在有效时间内没法收到 $f+1$ 个有效结果，则广播request到所有的replica。backup收到后，若该请求从未执行，则转发给primary，并且启动timer，在timer到期前若没有收到primary发出的pre-prepare，则判断该primary异常，发出view-change消息。backup $i$ , $<VIEW-CHANGE,v+1,n,C,P,i>_{\sigma_i}$ ， $v+1$ 是新的view number， $n$ 是该backup的stable checkpoint $s$ 的最后消息序号， $C$ 是证明 $s$ 的 $2f+1$ 个checkpoint消息， $P$ 是 $Pm$ 的集合， $Pm$ 是达到 $prepared(m,v,n,i)$ 状态的消息（包含pre-prepare消息和 $2f$ 个来自其他replica的matching消息，其中消息 $m$ 的序号大于 $n$ ）。

$v+1$ 的primary（通过 $p = （v+1）mod |R|$ 计算）收到 $2f$ 个有效view-change消息后，广播 $<NEW-VIEW,v+1,V,O>_{\sigma_p}$ ， $V$ 是收到的和primary自身的view-change消息的集合， $O$ 是pre-prepare消息的集合。计算如下，比较复杂

The primary determines the sequence number $min-s$ of the latest stable checkpoint in $V$ and the highest sequence number $max-s$ in a prepare message in $V$ .

The primary creates a new pre-prepare message for view $v+1$ for each sequence number $n$ between $min-s$ and $max-s$ .
There are two cases: (1) there is at least one set in the component of some view-change message in with sequence number $n$ , or (2) there is no such set.
In the first case, the primary creates a new message $<PRE-PREPARE,v+1,n,d>_{\sigma_p}$ where $d$ is the request digest in the pre-prepare message for sequence number $n$ with the highest view number in $V$ .
In the second case, it creates a new pre-prepare message $<PRE-PREPARE,v+1,n,d(null)>_{\sigma_p}$ where $d(null)$ is the digest of a special null request; a null request goes through the protocol like other requests, but its execution is a no-op. (Paxos [18] used a similar technique to fill in gaps.)

也就是说，primary从latest stable checkpoint重放之后所有达到 $prepared(m,v,n,i)$ 状态的消息（ $min-s<n<=max-s$ ），重发pre-prepare。replica会对这些消息重走流程，prepare和commit阶段，但是不再重新执行请求，因为在上一个view内，对这些达到 $prepared(m,v,n,i)$ 状态的消息已经执行过。

同时，replica记录收到的所有replica的view-change里最新的（ $n$ 最大）stable checkpoint记录到log里，并且对缺失的消息 $m$ 和最新的stable checkpoint（自身的stable checkpoint不是全局最新的stable checkpoint情况下）从其他的replica同步。

5. Correctness（重点）

Safety

if $prepared(m,v,n,i)$ is true then $prepared(m',v,n,j)$ is false for any non-faulty replica $j$ (including $i=j$ ) and any $m'$ such that $D(m')!=D(m)$

这是因为，到达 $prepare(m)$ 意味着 $2f+1$ 个replica选择发送 $prepare(m)$ ，假设异常节点个数为 $k(k<=f)$ , 则至少有 $2f+1-k$ 个正常replica发送 $prepare(m)$ ，同时也意味着至少有 $2f+1-k$ 个replica发送 $prepare(m')$ 。由于 $(2f+1-k)*2>3f+1-k$ ，说明这两部分正常节点有重叠，则至少有一个正常节点发送两个不同消息，矛盾。

The view-change protocol ensures that non-faulty replicas also agree on the sequence number of requests that commit locally in different views at different replicas. $committed(m,v,n)$ is true if and only if $prepared(m,v,n,i)$ is true for all $i$ in some set of $f+1$ non-faulty replicas. if $committed-local(m,v,n,i)$ is true for some non-faulty $i$ then $committed(m,v,n)$ is true.
这是因为，到达 $committed-local(m,v,n,i)$ 状态，则意味着至少收到 $f+1$ 个正常replica的commit消息，也就是至少有 $f+1$ 个正常节点到达 $prepared（m,v,n,i)$ 状态。而正常节点只有在接收到new-view消息才会从view $v$ 进入view $v+1$ ，而new-view消息里包含 $2f+1$ 个view-change消息。与上面推导方式相似，则这两个集合必定有一个交集包含正常节点k。若然 $m$ 已经包含在 $k$ 的stable checkpont内，则最终被同步到所有正常节点；若然没有，则包含 $k$ 的view-change消息内，然后在view $v+1$ 中被重新执行三阶段流程，最终也达到提交一致。

Liveness

replica广播view-change进入 $v+1$ ，等待 $2f+1$ 个view-change消息后启动计时器 $T$ 。如果在 $T$ 到时前没有收到new-view（新的primary是异常节点），则进入 $v+2$ ，计时器时间扩大成 $2T$ ，这是为了避免view改变的过快。
在定时器过期前，如果收到 $f+1$ 个view-change消息，并且比自身的view number大，则发送这些大view number中最小的一个
$f+1$ 个view-change消息才能触发view变更，而且primary的轮流选择方案也限定连续最多f轮选择就能选举到正常的primary。

Optimization

1. Reducing Communication

避免传输large result

A client request designates a replica to send the result; all other replicas send replies containing just the digest of the result.(指定单个replca回复完整result，其他回复该result的摘要用以校验)。 If the client does not receive a correct result from the designated replica, it retransmits the request as usual, requesting all replicas to send full replies.

减少流程步骤，replica提前返回reply。当replica收到足够prepare消息，达到了 $prepared(m,v,n,i)$ 状态，将要进入commit阶段。这时，如果小于该消息序号 $n$ 的之前的消息都已经执行生成了当前状态，则replica执行该请求，并且直接返回到client，同时继续后续的commit阶段。

The client waits for $2f + 1$ matching tentative replies. If it receives this many, the request is guaranteed to commit eventually. Otherwise, the client retransmits the request and waits for $f + 1$ non-tentative replies.

A request that has executed tentatively may abort if there is a view change and it is replaced by a null request. In this case the replica reverts its state to the last stable checkpoint in the new-view message or to its last checkpointed state (depending on which one has the higher sequence number).

优化read-only操作。对于read-only请求，client直接发送到所有replica，replica直接返回结果到client。client收集 $2f+1$ 个相同结果的回复，否则按照正常流程重新发起请求。

They(replica) send the reply only after all requests reflected in the tentative state have committed; this is necessary to prevent the client from observing uncommitted state.

题外话

所有的replica间需要全联接并且多次交互通信，系统中这些消息随着节点数指数增长。单次请求的最少消息数为 $1 + 3f + 3f(3f-f) + (3f-f+1)(3f+1) + 3f-1$ .

pBFT— Understanding the Consensus Algorithm

私货

为什么需要至少 $3f+1$ 个节点？
首先最后达成共识的正常节点数需要大于 $f$ ，否则这f个异常节点就可以达成错误的共识。即至少需要 $f+1$ 个正常节点达成共识。而达成共识时，有可能给最后共识投票的包含这f个异常节点，所以需要至少 $2f+1$ 个节点投票达成共识，这样可以保证其中至少含有 $f+1$ 个正常节点。（例如P为异常节点向不同的正常节点提议不同的值，这也是造成上文提到 $3f+1$ 个节点最后达成共识时只能保证多于 $f+1$ 个）。而最后的共识只能有一个，即达成共识的正常节点应该超过所有正常节点数的过半，否则 $f$ 个异常节点可以重复投票导致最后正常节点分裂成两组以上各自达成共识。因此，实际上最后的阈值 $T>=lower((R-f)/2)+1+f$ ，这样就保证了correctness正确性。接下来考虑liveness可用性。因为异常节点有可能不参与投票（P为正常节点），需要保证正常节点总数应该大于阈值，即 $R-f>=T$ 。

References

[1] M. Castro and B. Liskov. Practical Byzantine Fault Tolerance. Proceedings of the Third Symposium on Operating Systems Design and Implementation, New Orleans, USA, February 1999

[2] 区块链PBFT共识算法节点主动恢复设计与实现
[3] 拜占庭将军问题深入探讨