Grayscale release of Internet product release

Little scum · Posted on 3/9/2017 3:48:27 PM

The above picture is Tencent's grayscale release, ordinary users can access it, Alibaba Cloud server cannot be accessed, ping is normal, and the resolution IP is also normal

It's just inaccessible, it can be seen that Tencent also likes to play with grayscale release...

1. Why Grayscale Release

Internet services change frequently and release cycles are short. Speed and quality are always hard to combine.
Grayscale publishing can reduce the risk of publishing and reduce the scope of impact.
Reduce the dependence on testing and reduce the cost of data construction for offline self-testing.
It is convenient to centrally monitor logs and publish them in full Due to the role of load balancing at each layer, it is difficult to track a complete call link.
You can use Grayscale test accounts, and then grayscale real user accounts after the test account passes to further reduce the risk and impact of publishing.
Easy rollback.

Problems that cannot be solved by grayscale releases

It should be emphasized that the "tolerable impact" mentioned above must be recoverable, for example, the API cannot be called for a period of time, but after repairing, it can be successfully called. The permanent loss or destruction of user data (such as product information, order information, etc.) is intolerable. Therefore, it is the responsibility of the architects of Internet enterprises to repair the lost user data to a recent state (such as an hour ago to a week ago) through manual intervention in the case of loss of user data due to production system disorders (such as regular backup of user data, writing operation logs, etc.).

TIPS Test your account's grayscale policy first to reduce the risk of damaging or losing real users' data.

2. What effect is expected?
Regardless of the change, we want specific requests to be routed to our version of the change (grayscale version) for observation and validation.

3. Grayscale strategy
In fact, it is what requests should be routed to our grayscale version (grayscale machine). This is often strongly related to business. For example, for APIs, there are generally the following requirements:

Specific users (e.g., test accounts)
Specific apps (e.g., test apps or partner apps)
Specific modules and interfaces (only some interfaces need grayscale, which is generally a modification of API containers, and some APIs that are not very important are used for grayscale testing.) ）
Specific machine (some request IPs are forwarded to the grayscale machine)
4. Discussion of grayscale schemes
Solution 1: The code level is judged by the agreed flag, and the old and new are dynamically switched - Amazon's approach

Implementation:

Bury the switch in the code, make an if-else judgment, and set the switch to on for machines that require grayscale, otherwise it is off. There are two versions for each release.

merit

Fast rollback, no need to republish and reboot the system.
shortcoming

Be inclined to code.
Branching logic brings complexity.
This method was used by the author when I was in Alibaba, switching the database of goods from Oracle to MySql, and using a state variable for control. Thus achieving the effect of smooth migration.

Option 2: Pre-release machine - Alibaba's practice

In fact, this is not grayscale in the true sense. Because this pre-release machine is an internal IP and has no external service. Domain binding is required for verification. But the data is completely online. So it's essentially a simple approach for some specific users of Grayscale (users who have access to the grayscale machine, internal test users). In fact, there is a similar approach on the API side, which is our Gamma environment, and we also provide the domain name of the Gamma machine to facilitate external cooperative users to cooperate with testing.

merit

Simple
shortcoming

Waste a machine (this can be put into the production environment after the pre-release is completed, and removed from nginx during the pre-release, but O&M support is required.) ）
Not flexible enough
IDL services can only be used for access layer machines, and IDL services need to be considered separately.
Option 3: SET deployment

1. Deploy in isolation according to services

For example, in the current practice of API containers, the granularity of deployment can be reached to the API level, and the front-end forwards according to nginx. Like what:

Micro Shopping API Container: api.weigou.qq.com
Pat API Container:api.paipai.com
Yixun API Container: api.yixun.com
Online shopping API Container:api.buy.qq.com
The above is an isolated deployment at the large business level. It can also be further refined to the module level, such as the API of virtual service e-commerce, which is a sub-business module hanging under Paipai, but because they are connected to WeChat, the number of visits has increased significantly, in order to avoid affecting Paipai's other businesses, and in order to avoid being affected by other businesses, the API here is to deploy two machines separately for them, nginx can be configured to drain the virtual API access:

Virtual API Container: http://api.paipai.com/v2/virbiz

In this way, when we release a version, we can first choose Yixun with the smallest business volume to publish, and then observe that there is no problem before using all other platforms.

2. Deploy by user isolation

This is not very suitable for open platforms, but it is very suitable for application scenarios such as SNS. For example, the QQ system is divided into several sets according to user number segments, and each set contains 100 million consecutive numbers. Assuming that the latest QQ number is close to 1 billion, there are a total of 10 sets (Set 1 to Set 10). In this way, you can choose one of the SETS to publish each time, and high-level QQ is often not a very important user, so SET10 will be released first.

merit

Isolated deployment with minimal impact across business lines. Automatically support grayscale publishing.
shortcoming

The granularity of grayscale depends on the granularity of the isolated deployment, which is generally large.
Waste of machines compared to centralized deployment.
The versions of each business line may be inconsistent, which is not conducive to unified management.
There are certain implementation and deployment costs
Scheme 4: Dynamic routing

Method: Use a grayscale policy that can be flexibly configured to affect the behavior of the load balance and allow it to return the IP and port of the grayscale service according to the grayscale policy.

Suitable for service grayscale with back-office IDL.

merit

Flexible, controllable.
shortcoming

The current configuration center and L5 itself do not consider specified routing policies, and are not scalable, so they need to be developed outside of them.
The metadata sources of APIs are relatively scattered, and currently API and IDL metadata, API levels and frequency limits are distributed across different data sources, and now it is necessary to add a grayscale routing data source.

There are generally three ways to publish grayscale nginx+lua, nginx is distributed according to cookies, and nginx is assigned according to weight:
nginx+lua distinguishes according to the IP address of the visitor, because the company exports an IP address, and the website will be accessed either the old version or the new version, which is not suitable for this method
nginx assigns weights based on weights, which is simple to implement and can be tried
nginx splits based on cookies, and grayscale publishes based on users

Grayscale release of Internet product release

Related Posts

Sections viewed