High availability is one of the most essential and desirable metrics for all cloud services. Pan-net’s activities are focused on the development, delivery, and operations of cloud and cloud applications for external and internal customers. This means that high availability and fault tolerance topics are relevant for absolutely all areas of our activity - from internal development to large-scale vendor software solutions deployed on Pan-Net infrastructure.
The consistency of application technical design with the high availability concepts of the infrastructure - is a factor directly affecting the availability indicators of the final system. That is why we find it essential to talk about it.
We are opening a series of articles devoted to fault tolerance in the context of Pan-Net Cloud. In this first article, we will pay more attention to the most critical levels and roles involved in building a fault-tolerant cloud solution, talk about high availability and fault-tolerance concepts in Pan-Net, and demonstrate these concepts in a live production environment.
Cloud application ecosystem
Fault tolerance is the property that enables a system to continue operating correctly in the failure event of one or more of its components.
This definition sounds clear and even obvious. In practice, though, we want to understand first what the system comprises of and how its components affect fault-tolerance characteristics.
In this way, talking about cloud application fault-tolerance, one must not consider it separately from the environment it runs in. Building a fault-tolerant system and ensuring continuity of services must be considered when using an integrated approach covering the necessary sections of the system.
We can highlight the following key layers of the system for a cloud application:
Virtual Infrastructure layer
Physical Infrastructure layer
The orchestration layer brings together the products that manage the lifecycles of applications. This layer is a genuinely diverse range of products for VM-based architectures. For a containerized environment, the picture is more evenly delineated. The Container Orchestrator represents this layer. Container Orchestration has a strong presence in hyperscalers’ product portfolios. Such products as Amazon Elastic Container Service (Amazon ECS), Azure Container Instances that support Docker containers. Kubernetes-based Amazon Elastic Kubernetes Service (Amazon EKS) and Azure Kubernetes Service (AKS). Kubernetes and its components also power the Pan-Net CaaS solution.
The virtual Infrastructure layer ensures the vital activity of virtual infrastructure objects, providing virtualization functionality and virtual resource management. Pan-Net Cloud is based on the Openstack platform, which is responsible for allocating the infrastructure physical layer resources (such as processor time, RAM, data storage units, and the network) across consumers.
As a cloudification partner, we provide our customers with IaaS and CaaS Products and related services, spanning the Physical, Virtual Infrastructure, and Container Orchestration layers.
Pan-Net follows certain architectural and operational concepts to provide application-level resiliency and enable our customers to sleep well at night. And this is what we plan to talk about in the following chapter.
Pan-Net Cloud architecture and operations concepts
Pan-Net infrastructure scheme abides with resource management layer concepts. Today we have 17 data centers located in 9 Countries of Europe. Transferring this view to Pan-Net terminology, we talk about 9 Regions and 17 Region Availability Zones (RAZ) with independent and interconnected cloud instances. Each region has its own OpenStack deployment, including API endpoints, networks, and compute resources. Inside a single RAZ resource, consumers will meet a concept of Site Availability Zone (SAZ).
Any SAZ is represented by a group of hosts and composed of logical components intended to run a cloud environment - compute, storage, networking nodes, network devices, and cloud control plane nodes. To provide resiliency for cloud components and hosted applications, Pan-Net follows the 3-AZ (three Availability Zones) concept, which assumes that the number of SAZs in each data center is a multiple of three.
Pan-Net Cloud OpenStack management plane components run on physical servers in each SAZ. Each component runs in a separate ldx container. At least three instances form a cluster to provide load balancing and high availability. This configuration is relevant for all Cloud components, like Nova, Neutron, Cinder, Glance, etc.
A CEPH cluster backs Nova Compute and Cinder Block Storage services that stores data in triplicate, having copies in all SAZs.
Each SAZ contains a pair of TOR switches in virtual chassis mode, connected to the compute nodes and to the redundant Spines on the uplink, providing a reliable network structure.
In this way, a SAZ includes all the components for supplying an IaaS service, connected to another SAZ via a redundant network connection, and joined into clusters. When a server becomes unavailable due to a failure or planned maintenance, the cluster manager and the load balancer detect this and route traffic towards the remaining servers. After the server has been fixed or the scheduled maintenance is completed, the node joins back the cluster, and the load balancer distributes the traffic among all servers.
In the following video, captured in a real-life production environment, one can see how Pan-Net can upgrade and reboot one of the OpenStack controller physical servers without interruption in API calls. In this example, the server hosts a Keystone Identity Manager node.
The cloud maintenance plan is designed considering the 3-SAZ concept. At first, to avoid any service interruption, Pan-Net uses a method called live migration. Live migration provides a way to move a virtual machine from one compute node to another while the virtual machine works and provides services without noticeable interruption. When we start the migration, the OpenStack platform prepares a new virtual machine on a target compute node and transfers the workload. While the migration is ongoing, the virtual machine is fully available and usable. When the transfer completes successfully, OpenStack removes the original virtual machine. A showcase of how it happens in Pan-Net can be seen here.
Cloud nodes are operated only after the migration of compatible application components to non-maintenance nodes. Maintenance of nodes within one SAZ is performed sequentially. The pipeline transition to the next node is carried out after the successful run of the previous one (rolling upgrade and rolling restart). This design allows for high availability rates of applications, following this pattern.
It is worth noting that live migration can be applied only to virtual machines without specific requirements. In the case when the VM consumes the dedicated resources (such as CPU pinning or SR-IOV), migration becomes almost impossible without service disruption (applications running in these VMs are considered “legacy” or “not cloud-ready”). In this case, the right approach to the application design will maintain the desired availability indicators.
The following article will cover more about resilient cloud application design applying to the Pan-Net Cloud environment.