Some basic theories of distributed systems

Some basic theories of distributed systems


A year ago, I did some related summaries of Zookeeper, and now we pick it up again, and re-examine some of the causes and consequences more clearly.

Previous link

High concurrency from scratch (1) --- The basic concept of Zookeeper

High concurrency from scratch (2) --- Zookeeper realizes distributed locks

High concurrency from scratch (3) --- Zookeeper cluster construction and leader election

High concurrency from scratch (4) --- Zookeeper's distributed queue

High concurrency from scratch (5) --- Zookeeper configuration center application

High concurrency from scratch (6) --- Zookeeper's Master election and a small overview of the official website

1. The relationship between distributed systems and Zookeeper

1.1 Centralized service

We start the development process of service deployment architecture is talking about, in fact, nothing more than a centralized and distributed , centralized say, what I get is a machine made of. Distributed is the joint completion of multiple servers. So at the beginning, we usually start with a server and deploy our services. Then there are some old routines. Web applications are deployed on Tomcat and port 8080 is opened to provide services, and then a database service it needs is opened. Provided on port 3306. Its advantage lies in the relatively simple structure, deployment, and project structure.

Then expand according to the development of the business. The expansion can also be divided into two ways, one is horizontal expansion and the other is vertical expansion. Since one is unavailable, then either improve the performance of this server, or just install a few more. But let's think about it, it's not an individual who arranges the server to be obedient. As soon as this machine hangs up, it all hangs up. Moreover, the purchase of mainframes, as well as R&D and maintenance personnel, all cost a lot of money. Here is an extension of "Moore's Law" for everyone

Anyway, to put it simply, I can't buy twice the performance for twice the amount of money. But the horizontal expansion is different. One person can't beat it, can't it be done by calling more people together?

1.2 Go to IOE campaign

A slogan developed by Alibaba, the specific points are IBM minicomputers, Oracle databases, and EMC's high-end storage. Those who are interested can also learn about it. Because the problem faced at that time was that if a company needed to improve the processing capacity of a single machine, the cost would be very high and the price/performance ratio was extremely low. I was afraid of this and that all day, and the whole service would be stopped when the machine went down. Slowly, many domestic companies responded together, and the distributed system started.

1.3 Distributed Service

Distributed system has its specific definition: a distributed system is a system in which hardware or software components are distributed on different network computers, and they communicate and coordinate with each other only through message passing . So it's just a bunch of computers that unite to provide services to the outside world, but to the user, it's like a machine doing this.

There are many characteristics, roughly the following 5:

  1. Distribution: This means that multiple computers are placed in different locations
  2. Peer-to-peer: Multiple working nodes in the cluster are all the same thing, and they all do the same job. And there is the concept of copy
  3. Concurrency: Data inconsistency may be caused by multiple machines operating a piece of data at the same time
  4. Global clock: The sequence of events on multiple hosts will affect the results, which is also a very complicated problem in distributed scenarios
  5. Various failures: a certain node is down, the network is not good... Sudden situation

1.4 Several problems frequently encountered in distributed scenarios

  1. Abnormal communication: In fact, it is a network problem, which leads to inconsistent data in the multi-node state
  2. Network isolation: This means that each sub-network is normal, but the network of the entire system is abnormal. Problems that cause local data inconsistency
  3. Node downtime
  4. Distributed three states: success, failure, timeout, these three states lead to various problems. Both the request sending and the result response may be lost, and it is impossible to determine whether the message was sent/processed successfully
  5. Data loss: This is generally resolved by reading from other nodes through a copy mechanism, or for stateful nodes, the loss of data can be resolved by restoring the state.

Exception handling principle: any abnormal situation considered in the design phase must be assumed to occur in actual operation

1.5 Measuring the performance standards of distributed systems

  1. Performance: Mainly refers to throughput, response delay, and concurrency . The total amount of data that the system can process at a certain time is usually measured by the total amount of data processed by the system per second, and the response delay refers to the time required to complete a certain function. Concurrency is the ability to complete a function at the same time, usually measured by QPS

  2. Usability: The ability to correctly provide services in the face of various abnormalities. For example, the five nines we often say refer to only 5 minutes of downtime in a year. 6 9s is 31 seconds

  3. Scalability: refers to the effect of improving system performance by expanding the scale of the machine

  4. Consistency: copy management

But these standards are too high in one aspect, which will lead to deterioration in the other. For example, we need to be highly available and may need multiple copies, but in the state of multiple copies, it is difficult for data consistency To do it. Then it is difficult to achieve low latency under high throughput, so we need to consider our business scenarios.

1.6 For consistency expansion

  1. Strong consistency: After the write operation is completed, the read operation must be able to read the latest data. This is very difficult to achieve in a distributed scenario, such as the Paxos algorithm, the Quorum mechanism, and the ZAB protocol.

  2. Weak consistency: It does not promise that the written value can be read immediately, nor is it promised how long the data will be consistent, but it will be guaranteed to a certain time level (such as XX hours, XX minutes, and XX seconds later). A consistent state can be achieved.

It also has a special case called eventual consistency, which is to ensure data consistency as quickly as possible. But there is no precise definition of how fast this is. For example, the female ticket wants to eat fried chicken, you order a takeaway, but the rider of Meituan, are you hungry, the rider can't tell when it will be delivered, he can only promise to deliver it as soon as possible. That's what it means.

Because the final consistency is too weak, we still have some special cases where read and write consistency occurs. It means that users can always see their updated content as soon as they read the results written by themselves. This is like WeChat. It s the same as in Moments. WeChat will definitely let us see what we send, but if you can see it right away, I m not sure about it .

There are also some monotonous read consistency, and the causal consistency will not be explained. Interested friends can search by themselves.

All in all, in order to ensure the high availability of the system, prevent problems caused by single points of failure, and enable replicas distributed on different nodes to provide services to users normally, at this time, our Zookeeper came into being. It can help us solve the problem of data consistency in this distributed system

To solve this problem, we need to understand distributed transactions, distributed consensus algorithms, Quorum mechanisms, CAP and BASE theories, and then we will slowly expand

2. distributed transactions

Transaction: In a stand-alone storage system, it is used to ensure the consistency of the data state of the storage system. Is this a bit confusing to read? It s okay. Let s put it another way. A transaction in a broad sense refers to all operations of a thing, or all operations are successful. Not all fail, there is no intermediate state. In a narrow sense, those operations that the database does. The feature is also very simple, it is the familiar ACID.

In a distributed system, each node only knows whether its operation is successful, but does not know what the other nodes are. This may cause the status of each node to be inconsistent, so in order to achieve ACID that spans multiple nodes and guarantees transactions When, you need to introduce a coordinator , and then each node participating in the transaction is called a participant

The typical routines are 2PC and 3PC, and then we will slowly expand

2.1 What is 2PC

There will be multiple roles in the process of participating in the transaction. For the time being, we first understand that the coordinator is responsible for the initiation of the transaction and the participant is responsible for the execution of the transaction.

Suppose there are the above three roles, namely one coordination and two participation. At this time, we need A and B to perform a transaction, and require this transaction to either succeed or fail at the same time.

2PC Phase 1: Executing the transaction

At this point coordinator will first issue an order requiring participants A, B participants will have to execute this transaction, but not submitted

Speaking in more detail, it will generate redo and undo logs, lock resources, and execute transactions. But after the execution is complete, report directly to the coordinator and ask, can I submit it, brother?

This should often encountered in the course of daily writing of Java, it is written in front of a lot of operations, but will wait until the last write a conn.commit () such a thing, which is called the execution but does not commit

2PC Phase Two: Commit the transaction

When the coordinator receives feedback from all transaction participants (A, B in the figure) in the first stage (this feedback is simply understood as telling the coordinator that the previous first stage was successfully executed), it sends a command to all The participant commits the transaction .

If you want to be more detailed, that is, the coordinator receives the feedback, and all participants respond and can submit, then notify the participants to commit, otherwise rollback

So 2PC is also called two-stage submission . In fact, it is simply divided into two steps, one-step execution and one-step submission.

4 disadvantages of 2PC: performance

Looking at the entire process, we can see that this obviously has caused synchronization blockage, and each node that needs to operate the database occupies the resources of the database. Only when the coordinator receives the feedback that all nodes are ready, the transaction coordinator will notify the commit or rollback, and the participant will release the resource after the commit or rollback operation is completed.

4 disadvantages of 2PC: single point of failure

Then we just learned that the coordinator is the core of this matter. If the coordinator fails at this time, it will cause the problem that the notification cannot be communicated to the participants. For example, if the commit or rollback is not received, the entire transaction will be stalled.

4 disadvantages of 2PC: inconsistent data

The coordinator will send a commit or rollback in the second phase. However, this does not guarantee that every node will receive this command normally, so it may flee. Participant A receives the order and commits the transaction, but Participant B does not. Therefore, network fluctuations are the eternal cause, and you will never be able to avoid this factor.

4 shortcomings of 2PC: there is no fault tolerance mechanism

The coordinator needs to receive all the node feedback and preparation is completed before issuing the commit instruction. If any participant s response is not received, the coordinator will wait, and as long as there is a down node, the entire transaction will fail. Roll back.

2.2 What is 3PC

An improvement was made on the premise of 2PC, and the preparation phase in 2PC was split to form three phases: can commit, pre commit, and do commit.

In addition, a timeout mechanism is introduced . Once the transaction participant does not receive the commit or rollback instruction from the coordinator within a specified time, the local commit will be automatically performed to solve the single point of failure of the coordinator

3PC first phase cancommit

The coordinator first asked: Hey, can you guys do it? Participants answered yes or no based on their actual situation.

3PC second stage precommit

If the participants all return to agree, the coordinator sends a pre-commit request to all participants and enters the preparation phase. The preparation phase here is actually to let the participants lock resources, wait for instructions, and then execute the transaction. Also like 2PC, execute but not submit. Then wait for the instruction from the coordinator. If you can t wait for the instruction, it will be submitted locally after a period of time.

But this will also have drawbacks. For example, the coordinator successfully sends rollbacks to both participants 1, and then 3 does not receive it, then 3 is automatically submitted, so the timeout mechanism does not completely guarantee the consistency of the data.

3. distributed consensus algorithm

3.1 Paxos algorithm

I do not know I have not seen the previous year's essay zero-based high concurrency (c) --- build and leader election Zookeeper cluster To learn more about recommended Jump into the larger oh.

Paxos algorithm is a consensus algorithm based on message passing and highly fault-tolerant proposed by Lesile Lamport.

Does it feel convoluted? It's okay, we just need to know that in a distributed system, processes will inevitably be killed, messages will be delayed, repeated, lost... A series of problems, Paxos algorithm is what still guarantees data consistency under these abnormal conditions. So what does this thing have to do with Zookeeper? Zookeeper has a ZAB protocol, but the bottom layer of this ZAB protocol encapsulates the Paxos algorithm.

3.2 The roles that exist in Paxos and the relationship with the Zookeeper cluster

Proposer: As the name suggests, the person who initiated the proposal

Acceptor: They can vote and can accept or reject proposals

Learner: If the proposal is accepted by more than half of the Acceptors, learn the proposal

Mapped to the Zookeeper cluster, they are leader, follower, and observer. They are a bit like the chairman, deputies of the National People's Congress, and the people of the whole country. The chairman puts forward a proposal, and the people's deputies participate in the vote. The people all over the country passively accept it. feel. Compared with the previous 2PC and 3PC, it only needs half of the pass to submit. So this is weak consistency , 2PC, 3PC are strong consistency

3.3 Raft algorithm

Please click on this link, I believe you will be able to master it soon. me just explain it a little bit. This is a PPT format. I will tell you what Raft is, very easy to understand. I will skip some of the previous things and go straight to the topic.

Speaking of here, Raft is a protocol for implementing distributed consensus algorithms

It is assumed that a node has 3 different states

The first type, follower state (wireless bar)

The second, candidate state (dotted line)

The third type, the leader state (solid line) remembers that the leader is selected from the candidate candidates

First of all, when we come up, all nodes are in follower state

Next, all the follower nodes look for the leader. When they can't find it, they will spontaneously become a candidate and initiate a vote (ask others if they approve of me becoming the leader). Under what circumstances will they find it? It must be that the leader has died

At this time, it sends a proposal to other nodes to vote, and then other nodes will also give it feedback. When it receives feedback from more than half of the nodes, it can become the leader as a matter of course.

After that, the request to write data will be sent directly to the leader, and the leader will broadcast to other followers. At this time, as long as more than half of the nodes return positive feedback, the data writing transaction will be executed, and then the leader will send them a commit command. , The transaction is executed successfully.

3.4 ZAB Agreement

Content in the zero-based high concurrency (four) --- Zookeeper distributed queue

The underlying implementation of Zookeeper is the ZAB protocol, which implements the functions of crash recovery (leader crash) and message broadcasting (the client writes data to Zookeeper to ensure that multiple nodes are successfully written). The main thing is to ensure that the transaction submitted on the leader server is finally submitted by all servers, and to ensure that the transaction only proposed on the leader server is discarded

3.5 Quorum NWR mechanism

Quorum NWR: The Quorum mechanism is commonly used in distributed scenarios to ensure data security and to achieve eventual consistency voting algorithms in a distributed environment. The main principle of this algorithm comes from the pigeon nest principle. Its biggest advantage is that it can not only achieve strong consistency, but also customize the consistency level.

The pigeon nest principle, also known as the Dirichlet drawer principle and the pigeon cage principle.

One of the simple expressions is: If there are n cages and n+1 pigeons, and all the pigeons are kept in the pigeon cage, then at least one cage has at least 2 pigeons.

The other is: if there are n cages and kn+1 pigeons, all the pigeons are kept in the pigeon cage, then at least one cage has at least k+1 pigeons.

Why start from the drawer principle? Firstly, everyone is familiar with this and easy to understand. Secondly, it has similarities with Quorum mechanism. Drawer principle, each drawer of 2 drawers can hold 2 apples at most. No matter how you put 3 apples, there will definitely be 2 apples in one of the drawers. Then we change the drawer principle. One of the two drawers contains 2 red apples and the other contains 2 green apples. We take out 3 apples. No matter how we take it, at least 1 of them is a red apple. This is also very understandable. simple. We regard red apples as updated valid data, and green apples as invalid data that has not been updated. It can be seen that we can get valid data without updating all the data (not all red apples). Of course, we need to read multiple copies (take out multiple apples).

What does NWR refer to back to Quorum NWR mechanism?

N: The number of replicated nodes, that is, the number of copies of a piece of data that are saved. W: The number of nodes where the write operation is successful, that is, the number of copies that are successfully written each time data is written. W must be less than or equal to N. R: The minimum number of nodes required for a read operation to obtain the latest version of data, that is, the number of copies that must be read at least for each successful read.

Summary: These three factors determine the availability, consistency and partitions fault tolerance . As long as it is guaranteed (W + R> N), the latest data can be read, and the data consistency level can achieve strong consistency based on the constraint of the number of read and write copies!

The discussion is divided into the following three situations: the premise, when N has been fixed.

W = 1, R = N Write Once Read All

In a distributed environment, if you write a copy, if you want to read the latest data, you must read all nodes, and then take the value of the latest version. The write operation is efficient, but the read operation is inefficient. High consistency, poor partition fault tolerance, low availability

R = 1, W = N, Read Only Write All

In a distributed environment, all nodes are synchronized before they can be read, so as long as any node is read, the latest data can be read. Read operations are efficient, but write operations are inefficient. Partitioning has good fault tolerance, poor consistency, higher implementation difficulty, and high availability

W = Q, R = Q where Q = N/2 + 1

It can be simply understood as writing more than half of the nodes, then reading more than half of the nodes, achieving a balance of read and write performance. It is suitable for general applications, with a balance between read and write performance. For example, N=3, W=2, R=2, partition fault tolerance, availability, and consistency are balanced. And ZooKeeper is designed like this

What needs to be added is that Zookeeper does not realize that the client must read more than half of the nodes, so it allows the client to read the data that is not the latest synchronization completed, but the possibility is relatively small. The node whose data is not synchronized is actually not connected to the client, because whether it is a network problem or a machine problem that caused the leader to send data and it could not do it in the past, the client will definitely not be able to connect. If it happens that the client initiates the access in the intermediate state of the synchronized data, there is a way to solve it, and you can find out for yourself.

3.6 CAP theory

CAP theory: It was first proposed in July 2000. The CAP theory tells us that a distributed system cannot meet the three requirements of C, A and P at the same time.

C: Consistency, strong consistency, multiple copies of data in a distributed environment are consistent A: Availability, high availability, the service provided by the system must always be available, and every user's operation request can always return results within a limited time P: Partition Tolerance partition fault tolerance. When a distributed system encounters any network partition failure, it still needs to be able to provide services that meet consistency and availability.

Since a distributed system cannot meet the three requirements of C, A, and P at the same time, we have to choose based on our needs.

Give up P: The simplest extreme method is to place it on one node, which means there is only one copy of data, and all read and write operations are concentrated on one server, which has a single point of failure. Giving up P means giving up the scalability of the system, so in general, distributed systems will guarantee P

Abandon A: Once the system encounters a network partition or other failures, the service needs to wait for a period of time, and the service cannot be provided normally within the waiting time, that is, the service is unavailable

Giving up C: In fact, giving up consistency refers to giving up strong data consistency and retaining final consistency. How long it takes to achieve data synchronization depends on the design of the storage system

CAP can only choose 2 from 3, because in a distributed system, fault tolerance P is definitely a must, so there are only two situations at this time. Network problems lead to either error return or blocking waiting. The former sacrifices consistency, and the latter At the expense of usability. For example, HBase pursues data consistency, while Cassandra is usability.

Summary of experience:

1 CAP 
2 A C 
3 P  CA   MySQL
4 P A C A C 
  HBase, Redis  

Therefore, the BASE theory was born.

3.7 BASE theory

In most cases, in fact, we do not necessarily require strong consistency. Some businesses can tolerate a certain degree of delay consistency. Therefore, in order to take into account efficiency, the final consistency theory BASE has been developed, which is proposed by the architect of eBay. The full name of BASE theory: Full name: Basically Available (basically available), Soft state (soft state), and Eventually consistent (eventually consistent) the abbreviation of the three phrases. The core idea is: Even if strong consistency cannot be achieved, each application can use an appropriate method to achieve the ultimate consistency of the system according to its own business characteristics. In one sentence, don't go to extremes when doing things. BASE is the result of weighing C and A in the CAP theory.

Not a strong agreement, but a final agreement. Not high availability, but basic availability.

Basically Available (basically available): Basically available means that when a distributed system fails, it is allowed to lose part of the availability, that is, to ensure that the core is available

For example, Taobao Double 11, in order to protect the stability of the system, normal orders, other edge services may be temporarily unavailable. Downtime of some non-core services is allowed at this time.

Soft State (soft state): Soft state refers to allowing the system to have an intermediate state , and the intermediate state will not affect the overall system availability. Generally, there are at least three copies of a piece of data in distributed storage, and the delay that allows the synchronization of copies between different nodes is a manifestation of soft state. In layman's terms: Delays are allowed when different nodes synchronize data , and the intermediate state that exists when data synchronization delays occurs will not affect the overall performance of the system

Eventually Consistent: Eventually consistency means that all data copies in the system can finally reach a consistent state after a certain period of time . Weak consistency is the opposite of strong consistency. Eventual consistency is a special case of weak consistency, requiring final consistency instead of real-time strong consistency.


In general, we mentioned the analysis of centralized and distributed service deployment architectures, and the various problems encountered in designing distributed systems: the problem of data consistency

2PC and 3PC are common ideas, but there are still shortcomings. Even if Paxos Raft ZAB has related thorny problems such as abnormal distributed network communication, the above algorithms can also achieve consistency.

The parliamentary Quorum NWR mechanism: R + W> N ===> The minority obeys the majority

Conflict between consistency and availability, CAP and BASE: distributed systems must satisfy P, only trade-offs can be made between C and A

Most systems are BASE systems (basically usable + eventually consistent)