Network based systems

VMware: How VMware IT Automated Network Failover Testing to Deliver 99.99% Uptime

by: VMware Director Shared Services Edward Lyons and VMware Senior Manager Shared Service Delivery Parmesh Karthik

In 2015, VMware’s IT Network Services began deploying a new global networking standard to more than 70 offices and data centers around the world, focusing on network resiliency. The creation of redundant WAN, LAN, WLAN and security footprints allowed network availability to remain unchanged during an outage, when automatic failover moves from an active infrastructure to a standby infrastructure.

The challenge

The challenge was to develop a method to test whether the redundant infrastructure performed optimally as expected when the main service goes down.

Using business continuity planning (BCP) principles, VMware IT implemented a similar idea for network services and called it Network Verification Failover Test (NVFT).

Introduction to NVFT

Network failover is the ability to automatically and transparently failover to a backup or secondary network service to enable business continuity (BC).

To achieve redundancy during an abnormal failure of the active network infrastructure, a standby network infrastructure should always be ready to automatically resume service.

Scope of the changeover

Network failover testing is performed at all layers of the network service domain, such as:

  • Private WAN-Multi-Protocol (MPLS) Label Switching

  • Public WAN Internet

  • SD WAN

  • Firewall security

  • LAN core and LAN DMZ

  • Wireless network

  • Network device power supplies

How Failover Works

Active-active and active-passive are the most common configurations for high availability (HA). While both improve reliability, each failover technique achieves failover in a different way.

The failover test determines the ability of the standby / passive network service to handle the service during critical outages; this is achieved by shutting down the primary network infrastructure to validate the performance of the standby service without impacting the network service.

Failover test requirements

The original NVFT program was found to be effective for quality control and implementation approval, but it was very time consuming and resource intensive. The original test windows to complete the NVFT ranged from six to eight hours and required resources of:

  • PMO-release manager

  • IT Analytics management monitoring and alerting tools

  • NetOps-WAN Engineer

  • NetOps-security engineer

  • NetOps-LAN Engineer

  • Net Services Systems Administrator

  • CET-CET resource for peer-to-peer experience testing

Total resource hours: 56 (based on an eight hour test window)

A goal of each site tested each year proved difficult due to resource requirements, longer downtime, and limited NVFT running on weekends.

Regardless of the time and resource challenges, the NVFT itself has proven to be a key tool in identifying configuration errors and also in standardizing the network configuration following planned changes.

Stages of NVFT

  • Pre NVFT

  • Basic NVFT

  • Post NVFT

Pre NVFT

The pre-NVFT stage is also referred to as NVFT readiness control, where most of the physical connectivity of the network is examined to avoid failure during main activity.

The pre-activity also audits complete network services such as the hardware model, the running firmware, and the network configuration by which network standardization would be achieved.

Basic NVFT

After the successful completion of the pre NVFT activity, the actual failover is performed on all network layers, where the primary network services would be failed over to check the availability and performance of the secondary network services and vice versa. During this activity, observations are captured and noted for discussion.

Post NVFT

This step mainly focuses on documenting observations during the main activity and updating them in the “risk register” for future reference.

The risk register is a repository for all identified risks and includes additional information about each risk such as the nature of the risk, the owner, the baseline and the mitigation measures.

Evolution of NVFT

The NVFT did not advance in a single day; rather, it took a reasonable time to become a more productive and reliable test system.

After constant upgrades, with new ideas and methods, the failover testing went through several phases.

NVFT Phase I

This is the initial period of failover testing, during which only core network services were tested. This was a completely manual process that was very resource intensive, requiring up to eight engineers on call; the downtime window was almost eight hours depending on the size of the office (small, medium or large). With these challenges, we have achieved NVFT in an average of 27 locations per year.

NVFT Phase II

This phase saw a big step forward in failover testing by incorporating new ideas. As the network topology increased, all network layers were part of the failover mechanism. The whole activity has become fully automated; thus, the resources required for the activity were reduced to just two engineers, with total downtime significantly reduced to 120 minutes. This has led to a massive increase of around 130 NVFT events per year.

Results

  • Resources have declined by 71 percent

  • 75% reduction in downtime window

  • Multiplication by four of the number of NVFT

  • Running NVFT during the working week has become possible

  • Additional testing has been included so that failover testing of the entire network infrastructure is possible.

Through automation, VMware service delivery has been able to meet the goal of running an NVFT at every site every year and we now have a new target of running NVFT at every site every quarter. We were also able to completely eliminate the resource requirement during the NVFT window, much to the delight of our operations teams.

NVFT Enabled VMware IT ensures that new networking standards deliver improved availability, performance, and security, and reduce the number of Severity 1 incidents. And, most importantly, it allows us to provide an enjoyable experience for our customers. colleagues, ensuring high network availability with 99.99% availability.

throughout the year.

VMware on VMware Blogs are written by IT experts who share stories about our digital transformation using VMware products and services in a global production environment. Contact your sales representative or[email protected] to schedule an information session on this topic. Visit VMware on VMwaremicrositeand follow us onTwitter.



Source link