High availability in the Cloud: Legacy cloud application design

Aug 23, 2021 · 991 words · 5 minute read cloud openstack high availability fault tolerance

This is a continuation of the previous blog post High availability in the Cloud: Pan-Net Cloud concepts. Still, we are trying to focus here on VM-based (legacy) deployed applications.

In the topic’s context, we talk about the physical architecture of the application, leaving out the logical architecture as it does not directly map to infrastructure objects. So, when designing the physical architecture of the application, architects incorporate concepts of fault tolerance and resiliency right at the beginning.

Engineers try to find the best compromise between the required availability and the number of resources consumed. Engineers must also consider the environment where the application is to be deployed. Therefore, system components are designed in a fault-tolerant mode. These design solutions integrate with the functionality of the Cloud Provider, such as data storage or network load balancers. Using the features of a cloud provider can reduce the time to develop your components.

So as we already mentioned in the previous blog post, Pan-Net follows the 3-AZ (three Availability Zones) concept, which assumes that the number of Site Availability Zones in each data center is multiple of three. This concept is the main thing that needs to be considered during the application design process to achieve high availability in the Pan-Net cloud. When creating an instance (Virtual Machine), the user can specify the availability zone in which it should be created. Choosing a specific compute node within the availability zone where the instance should be started may even be possible. And it is highly recommended not to run just one VM instance of application but to create at least a 3-node cluster where the nodes are created in different availability zones. This setup enables an application to survive failure, upgrade, or other infrastructure outages because it complies with the cloud architecture.

Below are details of recommendations or requirements that should be met to achieve cloud-ready or high available applications in Pan-Net Cloud:

How to assess app cloud readiness for Pan-Net Cloud or How does a perfect application from Pan-Net Onboarding engineer point of view looks like?

To accomplish this, we may need to answer some more questions. Such as “which features or architectural elements should an app include specifically?” Or, “which characteristics should it have to be both successfully onboarded to the cloud and operated as a truly cloud-ready application?”

Regarding application design, we should first consider the significant concept of fault-tolerant architecture - it should have no single point of failure. This means that each logical component of the application is clustered.

In addition, the application must be designed in such a way as to scale horizontally by increasing and decreasing the number of component instances - as opposed to vertical scaling, which changes the number of resources consumed by the component. Horizontal scaling allows quickly scaling your application without added risks to its availability.

Then the following requirement is that the storage is not shared between application components – that a replication mechanism is used, for example. It’s also preferred if the application is HW agnostic, including processor types/suppliers and so on. Finally, it is always better to align the application architecture with the cloud provider HA concepts described in the next section as a rule of thumb.

How to achieve HA in the cloud

Consider unplanned or unannounced cloud maintenance (to ensure that cloud maintenance will not affect the app or service availability).

Again, we always need to realize that applications shouldn’t have a single point of failure. And then the most important thing is that the components should be deployed across all AZs, as we already mentioned few times. So application design and deployment schema assume that rebuilding any application component will not affect its functionality (that implies that the application components are stateless).

Engineers should also keep in mind that application components should have affinity or anti-affinity rules. Application component scheduling should be done so that the app functionality is not affected when one CN (compute node) is down.

Now how to manage data and databases (an intrinsically stateful system)? Here replication of data is highly preferred, so all data should be replicated as much as possible where it’s applicable. And also, all app-related data should be stored on a separate, extra Block Storage volume – like Cinder. For example, don’t store data on the OS root path (/) - so that application nodes can be safely rebuilt and run later with the correct data, thus avoiding migration.

Real-world Pan-Net experience

All the above is nice and neat; however, in the real-world scenario, vendors and software companies require some features that are fundamentally not in line with the “generic cloud environment” strategy. Such features include single-root input/output virtualization (SR-IOV), CPU-pinning, or applications that may have a single point of failure by design. For example, only one instance of the application is allowed to run, and the horizontal scale is impossible.

On the other hand, some features are manageable but require additional effort during the onboarding and automation process:

  • Application has vendor-specific requirements: each vendor has its processes that can be very strict and specific.

  • Strict maintenance windows: this imposes certain restrictions on the application operations and cloud operations team.

  • Collocation: specific HW or server types need to be integrated into the OpenStack tenant of a customer project.

  • Lawful interception: this can also happen, and it’s not unusual – it may demand specific collocation sites or change the cloud architecture.

Pan-Net is commonly dealing with such kinds of applications due to a market-specific. As we cannot use fault-preventing instruments during cloud operation tasks, the application design should consider fault tolerance.

Based on all that, redundancy and resilience are today being carried out and built up at another layer, above the Virtual Machine layer: Container Orchestration (e.g., Pan-Net CaaS).

This layer lives on top of OpenStack. It meets all availability requirements (like the 3-AZ availability concept), but the following article will describe more details about resilient containerized applications and Pan-Net CaaS.

Laco Miklovitz
DevOps & SRE Engineer