Several explosive theories of the cloud-native era

Since switching to cloud native last year, I've had some thoughts and wanted to share a few explosive theories about cloud native as my last technical blog of the year. This article is purely personal commentary and does not reflect my company's stance.

Overview#

The concept of cloud native was officially proposed around 2014-2015. In 2015, Google led the establishment of the Cloud Native Computing Foundation (CNCF). In 2018, CNCF first defined the concept of cloud native in CNCF Cloud Native Definition v1.0¹:

Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.
These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.

The translation in Chinese is as follows:

Cloud native technologies enable organizations to build and run scalable applications in modern dynamic environments such as public, private, and hybrid clouds. Representative technologies of cloud native include containers, service meshes, microservices, immutable infrastructure, and declarative APIs.
These technologies can build fault-tolerant, manageable, and observable loosely coupled systems. Combined with reliable automation, cloud native technologies allow engineers to easily make frequent and predictable significant changes.

From the official definition, I prefer to call it a vision rather than a definition because the expression does not clearly articulate the specific scope and boundaries of the new concept of cloud native, nor does it clarify the differences between Cloud Native and Non-Cloud Native.

From a personal perspective, a cloud native application possesses the following characteristics:

Containerization
Service-oriented

An organization practicing cloud native should have the following characteristics:

Heavy use of Kubernetes or other container orchestration platforms (such as Shopee's self-developed eru2²)
A complete monitoring system
A complete CI/CD system

Based on this, I've recently seen many discussions about the new concept of cloud native, so I want to share my personal four explosive theories (the data in these theories is my subjective judgment, please be gentle).

Over 95% of companies have not completed the establishment of a CI/CD system. They have also not completed the convergence of online service processes.
Over 90% of companies do not have the technical reserves to implement microservices.
Over 90% of companies do not have the technical reserves to support containerization.

Starting the Theories#

1. Over 95% of companies have not completed the establishment of a CI/CD system. They have also not completed the convergence of online service processes.#

CI stands for Continuous Integration, while CD stands for Continuous Delivery. Generally speaking, the definitions of CI and CD are as follows (here I quote Brent Laster's definition in What is CI/CD?³):

Continuous integration (CI) is the process of automatically detecting, pulling, building, and (in most cases) doing unit testing as source code is changed for a product. CI is the activity that starts the pipeline (although certain pre-validations—often called "pre-flight checks"—are sometimes incorporated ahead of CI).
The goal of CI is to quickly make sure a new change from a developer is "good" and suitable for further use in the code base.
Continuous deployment (CD) refers to the idea of being able to automatically take a release of code that has come out of the CD pipeline and make it available for end users. Depending on the way the code is "installed" by users, that may mean automatically deploying something in a cloud, making an update available (such as for an app on a phone), updating a website, or simply updating the list of available releases.

In our practice, the boundaries between CI and CD are not always clear. Taking a common Jenkins-based practice as an example, our typical path is:

Create a Jenkins project, set up a Pipeline (which includes tasks like code pulling, building, unit testing, etc.), and set trigger conditions.
When there are operations like merging code into the main branch of the specified code repository, execute the Pipeline and generate artifacts.

After generating artifacts, there are two common approaches:

Trigger an automatic deploy process in the next stage of artifact generation, directly deploying the generated artifact/image to the target server according to the deploy script.
Upload the generated artifact to an intermediate platform, where a person manually triggers the deployment task through the deployment platform.

In the process described above, companies with a complete process will also have other auxiliary processes (such as CI processes during PR/MR, CR processes, etc.).

When it comes to deploying to the target platform, I have another viewpoint that most companies have not completed the convergence of online service processes. Here’s a joke:

Q: How do you deploy online services? A: nohup, tmux, screen.

Currently, a standardized CI/CD process and the management of online service processes have several foreseeable benefits:

Minimizing the risks associated with manual changes.
Effectively completing the configuration of basic operational dependencies.
Relying on mainstream open-source process management tools like systemd, supervisor, pm2, etc., to provide basic HA guarantees for processes (including process health checks, process restarts, etc.).
Laying the groundwork for subsequent service-oriented and containerized steps.

2. Over 90% of companies do not have the technical reserves to implement microservices.#

If for the CI/CD methods mentioned in explosive theory 1, I feel that this is more of an institutional barrier than a technical barrier, then for the next few explosive theories, I prefer to describe them as lack of technical reserves.

Let’s talk about explosive theory 2: Over 90% of companies do not have the technical reserves to implement microservices.

First, let’s discuss the concept of microservices. Microservices have different interpretations in the history of computing. In 2014, Martin Fowler and James Lewis formally proposed the concept of microservices in Microservices a definition of this new architectural term⁴. Here’s a segment from Wikipedia:

Microservices are small services that make up a single application, each owning its own process and lightweight handling, designed around business capabilities, and deployed in an automated way, communicating with other services using HTTP APIs. At the same time, services use minimal centralized management capabilities (e.g., Docker), and can be implemented using different programming languages and databases.

Now, let’s try to describe the significant differences between microservices and traditional monolithic services in terms of development:

Microservices have a smaller scope, focusing more on a specific function or a category of functions.
Due to their smaller scope, the impact of changes or crashes is smaller compared to traditional monoliths.
They are more friendly to teams with multiple languages and technology stacks.
They align with the current internet demand for small, rapid iterations and fast-paced development.

Now we need to consider what technical reserves are needed to implement and practice microservices. I believe there are mainly two aspects: architecture and governance.

First, let’s talk about architecture. I think the most troublesome issue for microservices is splitting from traditional monolithic applications (of course, if microservices were adopted from the very beginning, I won’t mention that, although it has its own issues).

As mentioned earlier, microservices have a smaller scope compared to traditional monolithic applications, focusing more on a specific function or a category of functions. The biggest challenge in implementing microservices is reasonably defining functional boundaries and splitting them.

If the splitting is unreasonable, it will lead to coupling between services. For example, if I place user authentication in the e-commerce service, it will cause the forum service to depend on the unnecessary e-commerce service. If the splitting is too fine, it can lead to an interesting phenomenon where a relatively small business ends up with over 100 service repositories (we call this situation: microservice refugees 2333).

We practice the microservices philosophy because, as our business and team scale grow, facing diverse demands and team members' technology stacks, the maintenance cost of traditional monolithic applications will be significant. We hope to introduce microservices to minimize maintenance costs and reduce risks. However, unreasonable splitting can lead to maintenance costs far exceeding those of continuing with a monolithic approach.

Another issue hindering the practice of microservices is governance. Let’s look at some problems we face after adopting microservices:

Observability issues. As mentioned earlier, individual service scopes are smaller after microservices, focusing more on a specific function or a category of functions. This can lead to longer request chains needed to complete a business request. Generally speaking, longer chains carry greater risks. So when a service experiences an anomaly (e.g., a sudden increase in business RT), how do we locate the specific service's problem?
Configuration framework convergence. In microservices scenarios, we may choose to sink some basic functionalities into specific internal frameworks (like service registration, discovery, routing, etc.), which means we need to maintain our framework while completing configuration convergence.
The usual service governance issues (registration, discovery, circuit breaking), etc.
The demand for a complete CI/CD mechanism becomes more urgent after adopting microservices. If the situation in explosive theory 1 exists, it will become an obstacle to practicing the microservices philosophy.

Indeed, both the open-source community (like Spring Cloud, Go-Micro, etc.) and the four major cloud vendors (AWS, Azure, Alibaba Cloud, GCP) are trying to provide out-of-the-box microservices solutions. However, in addition to not being able to effectively solve the aforementioned architectural issues, they also have their own problems:

Whether relying on open-source community solutions or cloud vendor solutions, users need a certain level of technical literacy to locate issues within the framework in specific situations.
Vendor lock-in. Currently, there is no universal open-source standard for out-of-the-box microservices solutions. Therefore, relying on a specific open-source community or cloud vendor's solution will lead to vendor lock-in issues.
Both open-source community solutions and cloud vendor solutions have issues with multi-language compatibility (it seems everyone now prefers Java a bit more (Python has no rights.jpg)).

Thus, the core point of explosive theory 2 is: microservices are not a cost-free endeavor; on the contrary, they require significant technical reserves and human investment. So please do not think of microservices as a universal remedy. Use them as needed.

3. Over 90% of companies do not have the technical reserves to support containerization.#

A currently mainstream viewpoint is to use containers wherever possible. To be honest, this idea has a certain degree of rationality. To review this idea, we need to look at what changes containers bring us.

Containers undoubtedly bring us many benefits:

They allow for true consistency between development and production environments, making it convenient. In other words, when a developer says, "This service has no issues on my local machine," it becomes a useful statement.
They make deploying services more convenient, whether for distribution or deployment.
They can achieve a certain level of resource isolation and allocation.

So, can we use containers without thinking? No, we need to review some potential drawbacks we may face after containerization:

Container security issues. The most mainstream container implementation (here I mention Docker) is fundamentally based on CGroups + NS for resource and process isolation. Therefore, its security will be a significant concern. After all, Docker has vulnerabilities related to privilege escalation and escape every year. This means we need a systematic mechanism to regulate our container usage to ensure that related privilege escalation points can be controlled within a manageable range. Another aspect is image security. Everyone is coding based on Baidu/CSDN/Google/Stackoverflow, so it’s inevitable that when we encounter a problem, we search and directly copy a Dockerfile. This can pose significant risks, as no one knows what has been added to the base image.
Container networking issues. After starting several images, how do we handle network communication between containers? In a production environment, there are certainly more than one machine, so how do we ensure stable networking while allowing inter-container communication across hosts?
Container scheduling and operation issues. When a machine is under high load, how do we schedule some containers from that machine to other machines? How do we check if a container is alive? If a container crashes, how do we restart it?
Specific details about containers, such as how to build and package images? How to upload them? (This goes back to explosive theory 1) And how to troubleshoot some corner case issues?
For specific large-size images (like the official CUDA images commonly used by machine learning practitioners, which package large amounts of data), how do we quickly download and release them?

There might be a viewpoint here: no worries, we can just use Kubernetes, and many of these problems will be solved! Well, let’s discuss this issue further.

First, I have ignored the scenario of building a self-hosted Kubernetes cluster, as that is not something the average person can handle. Now, let’s look at the situation of using public cloud services. Taking Alibaba Cloud as an example, we open the page and see this image:

Now, let’s ask some questions:

What is VPC?
What are the differences between Kubernetes 1.16.9 and 1.14.8?
What are Docker 19.03.5 and Alibaba Cloud Security Sandbox 1.1.0, and what are the differences?
What is a private network?
What is a virtual switch?
What is a network plugin? What are Flannel and Terway? What are the differences? When you flip through the documentation, it tells you that Terway is Alibaba Cloud's modified CNI plugin based on Calico. So, what is a CNI plugin? What is Calico?
What is Pod CIDR and how to set it?
What is Service CIDR and how to set it?
What is SNAT and how to set it?
How to configure security groups?
What is Kube-Proxy? What are the differences between iptables and IPVS? How to choose?

You can see that the above questions cover several aspects:

In-depth understanding of Kubernetes itself (CNI, runtime, kube-proxy, etc.)
Reasonable network planning
Familiarity with specific functions of cloud vendors

In my view, any of these three aspects requires a significant level of technical reserves and understanding of the business (broadly defined technical reserves) for a technical team.

Of course, let me ramble a bit more. In reality, managing a Kubernetes setup is quite costly (a bit off-topic, but I’ll continue):

You need a container registry, right? Not expensive, the basic version in China is 780 per month.
Your services within the cluster need to be exposed, right? Okay, buy the lowest specification SLB, simplified version, 200 per month.
Alright, you need to spend money on logs every month, right? Assuming you have 20GB of logs per month, not too much? Okay, 39.1.
Do you need monitoring for your cluster? Alright, buy it, let’s say 500,000 log entries reported daily? Okay, not expensive, 975 per month.

Let’s calculate for one cluster: (780 + 200 + 39.1 + 975) * 12 = 23292.2 per year, not counting the basic ENI, ECS, etc. costs. Nice!

Moreover, Kubernetes has many esoteric issues that require the technical team to have sufficient technical reserves to troubleshoot (I recall encountering issues like CNI process crashing without restarting, kernel cgroup leaks on specific versions, ingress OOM, etc.). You can check the Kubernetes issue section to see the situation (too much to say, it’s all tears).

Conclusion#

I know this article will spark a lot of controversy. However, I want to express the viewpoint that the introduction of this set of technologies in the cloud native era (which is actually more of an extension of traditional technologies) is not without cost or without expense. For companies with sufficient scale and pain points, such costs can positively promote their business growth, while for more small and medium-sized enterprises, this set may have very little or even negative effects on business improvement.

I hope that when we technical personnel make technical decisions, we evaluate our team's technical reserves and the benefits to the business before introducing a certain technology or concept, rather than adopting a technology just because it seems advanced, impressive, or can enhance my resume.

Finally, let me conclude this article with a quote I shared before:

A company pursuing technological advancement for the sake of technology is doomed.

Reference#

1. CNCF Cloud Native Definition v1.0

1. projecteru2

1. What is CI/CD?

1. Microservices a definition of this new architectural term