Traditional Culture Encyclopedia - Traditional stories - Thoughts about a current limit

Thoughts about a current limit

An API interface /srm/api2/disabletime needs to provide a maximum capacity of 600qps. After exceeding 600qps, it needs to be limited and return 429 http code.

Currently, there are 20 business nodes at a time. In order to increase the availability in remote locations, they are distributed in 2 computer rooms, gz and gz6, with ten machines in each of the two computer rooms.

There are usually two ways to limit service flow: Both methods have their own advantages and disadvantages. This experiment is the first method used. In order to ensure that the flow can be limited after the qps reaches 600, so under ideal circumstances (that is, the traffic distribution is uniform

(situation) The qps of each machine is 30=600/20, but the actual situation found that the number of requests for the API on the machine every day is very uneven, some are single digits, and some reach more than 50.

Some people may be wondering, what is the impact of uneven traffic on each node?

Then let me explain.

When the traffic of the entire service reaches 600qps, due to uneven traffic, the traffic distribution of some nodes is 40 or 50 requests, and some nodes are single-digit 8 or 5 requests.

Now the configured qps of each machine is 30, so those nodes that are greater than 30 qps will trigger current limiting. Eventually, the entire service will trigger current limiting when it does not reach 600 qps.

If you want to understand why the traffic is uneven, you need to first understand the overall business structure.

The initial understanding of the architecture is as follows. It can be clearly found that the number of SLBs in the gz and gz6 computer rooms is inconsistent, which results in the business machine traffic of gz being greater than the business machine traffic of gz6.

Theoretically, the SLB traffic of gz will be sent to nginx of gz, and the SLB traffic of gz6 will be sent to nginx of gz6.

Tencent Cloud's load balancing does not distinguish between gz and gz6 in the traffic sent to SLB, and implements load balancing of the polling strategy on all SLB machines.

Then the number of SLB machines in gz is more than that in gz6, then the gz computer room will receive more traffic than the gz6 computer room, and gz's nignx will receive more traffic than gz6's nginx.

The above explanation seems reasonable, but the conclusion is: Since the number of business nodes in the two computer rooms of gz and gz6 is the same, but the number of SLBs of gz and gz6 is different, it ultimately leads to uneven traffic in the two computer rooms, which in turn leads to the service nodes of gz and gz6.

The flow is uneven.

The figure below is a traffic trend chart of 600qps per second during a 2-minute stress test on the interface. It can be seen that the traffic in gz and gz6 is roughly evenly distributed, roughly around 300Qps, which indirectly proves that the above idea is wrong.

The traffic divisions in the two pictures above confirm that the above architecture is incorrect. Through query, the correct architecture is as follows. It can be seen that there are several differences from the above architecture diagram. The architecture seems to be OK. The traffic reaches the final business node.

It is also uniform, but the fact is that the traffic is uneven. Let's take a look at the actual situation. For example, the rotation training business node is a->b->c: First, the rotation training of slb is based on the specified URL, which is based on the entire service.

All requests are subject to rotation training. For example, the first stress test request hits business node a. If there are no other requests for the entire service at the current time, then the second stress test request will definitely hit business node b. However, the actual situation

Yes, there must be other requests at the current time, so the second stress test request may still hit business node a, or business node b, or any business node in the cluster. I guess this is the reason.

resulting in uneven flow.

In order to prove this conjecture, we can see whether all requests on all machines at 21:49:00 are roughly average. If they are average, it proves that our conjecture is correct.

It can be seen that the request volume of all machines is roughly the same, and our guess is correct.

There is usually a major prerequisite for using the first method: the traffic distribution of each machine must be very even, and the qps configured for each machine = total qps/number of machine nodes.

However, due to network instability or other reasons, the traffic is usually uneven, so each node needs to be configured with qps plus some Buffer, 40 or 50 qps.