Category: Business

On-Call Risk

I’m spending an hour or more running on most days now, with a couple hours and steadily increasing each Saturday. I’m wanting to get back to spending time in nature — backpacking, hiking, and other activities that take us away from civilization. The issue with that, is that I’m always on call. Kelly and I make our living two ways. First, I have a work from home position as a Network Security Engineer for a large corporation, that provides our base salary. It’s flexible, but demands 40+ hours a week of my attention just like any other full-time job. I love the company and the team I work for – there have been some rough patches, but the only way I’ll give up this position is if it’s forcefully taken from me. Second, we have our own consulting and hosting company. Between the two, we have enough income to live comfortably — but if we lost one or the other, we’d eventually have to find a replacement. For my day job, a coworker and I have arranged a rotating series of on-call shifts. The on-call load is relatively light, and it’s usually pretty easy for me to arrange to not be on call if I’m going to be busy, by just giving him a heads up. He tends to stay close to home and works in the UK, where he gets paid extra for being on call, so he is usually happy to soak up these hours. On the flip side, for the consulting company, all the on-call work is on me. I have a friend who will jump in and take care of things from time to time, but he’s not a full time employee – rather an occasional contractor – and I can’t pay him to sit around waiting on things to break. Not to mention, he has other things that demand his time. We have two racks of equipment for our hosting services. The majority of that equipment is several years old – architected on a strict budget, with many ad-hoc changes made over the years, and too many single points of failure for me to feel comfortable taking an extended leave of absence. We have some customers who rely on our hosted services for their daily workflow and income, pay us accordingly, and demand continuous uptime. We moved into the first rack back into 2011, and the contract is coming up for renewal again this June. Since we’re negotiating a new contract, or a possible move to a new facility, I decided to take a fresh look at our architecture to alleviate all single points of failure. My new goal is to make it so that if failures occur, I don’t have to be around to immediately address them. There are a few things in the racks that belong to customers – we don’t necessarily take responsibility for them, and if the hardware has issues, then those issues are theirs to repair. I’ll disregard those in this design, if I can’t convince the customers to move them on-premesis or to move to high-availability solutions. First is addressing the switching infrastructure. We currently have a single switch in each rack. I’ve spent the last few years working with HPE Comware equipment for my primary employer, and have been really impressed with the hardware. Because we have a mix of 1G and 10G equipment, I ended up buying two 5900AF-48G-10XG-4QSFP+ and two 5820AF-24XG switches. That’s moving from one 48-port switch per rack, to four switches (hoping to condense both racks to a single one, to reduce monthly expenditures). These HPE switches support IRF, Intelligent Resilient Framework, a type of stacking that allows for advanced configurations such as LACP bonding across chassis. The switches themselves have redundant fans and power, and if in pairs, you can connect servers to both switches with bonding, to survive complete switch failure, cable failure, optic failure, and a few other usually catastrophic scenarios. I have one IRF stack containing the 1G switches, another consisting of the 10G switches, and a 4x10G bond between them. These switches also support BGP and advanced hardware ACL processing, which allows me to alleviate the redundant routers we’re currently running. I’ll be replacing the services the switches can’t do with virtual routers. All of the described infrastructure below will be exclusively 10G, except for management and out of band. Next, I wanted to replace the aged VMware infrastructure that consists of several individual servers with some cross-server replication, with a true HA cluster. I ended up ordering three servers that contain dual 8-core CPUs, 384GB RAM, and 4x 10Gb NICs each to start with. They’re running Proxmox, which has been configured for high availability. Networking consists of two bonds that stagger between both NICs (each bond consists of a port from each of the two cards) and switches. The first bond is for data, and the second dedicated solely to storage. This arrangement gives me N-1 RAM capacity, or 768GB to start with, to maintain full ability to sustain a node failure. I currently have about 500GB in mind to migrate, so it will give me a moderate amount of room to grow. One of the great things about this solution is that I can add additional hosts at any time – and even re-task some of the existing hardware in the rack to serve as hosts in this cluster once I have the guests migrated to the new infrastructure. Doing so would be a free or low-cost way to add another half terabyte of RAM, or more. Last, and the thing I am most excited about, is the storage solution. I would have rebuilt the hypervisor solution a long time ago if I had a centralized commodity storage solution I was comfortable with, or the ability to invest in an off the shelf one that would meet all my requirements (they start in the tens of thousands and skyrocket from there). Ultimately, was able to engineer a solution that checks off all the boxes, using technologies I’m already familiar with, by accepting that a couple key technologies I rejected a long time ago have since become viable. The end result physically consists of two storage heads running Debian and dual-controller SAS expanders (which we can easily add at any time to increase capacity) that hold the drives. Without getting in a ton of detail that is beyond the scope of this blog, these are a few of the failure scenarios I’ve simulated while running intensive stress tests and/or benchmarks, without issue: Cutting power to each 5900AF-48G-4XG-2QSFP+ switch (one at a time) Cutting power to each 5820AF-24XG switch (one at a time) Bulk live migration of VMs Cutting power to a host with active VMs (they die, but automatically boot up on other hosts) Cutting power to the active storage head Cutting power to the standby storage head Pulling the active SAS controller out of the expander Pulling the standby SAS controller out of the expander Randomly disconnecting network cables for A or B side connections Cutting A or B side PDUs For all these solutions – switches, hypervisors, and storage – another upside is that we can perform upgrades and maintenance without impacting customers. I’ve spent a week building this – the last three or four days of which consisted of extensive load/stress testing, and failure simulation. I have tested several failure scenarios, under heavy load, with excellent results. The new stack will hopefully be racked in the next month, if early contract negotiations with our datacenter pan out as I hope. Once this solution is in place, VMs migrated over, and the old gear retired, I’ll be a lot more comfortable being out of reach and away from the computer. I’m sure at least a few of our customers will be happier being on true HA solutions. Performance looks like it will increase nicely. It’s a win-win, really, even though it did cost a pretty penny in new hardware. I’ve been avoiding this for a long time – not wanting to invest, not believing that the technologies needed for the storage solution were mature enough (ZFS on Linux, primarily) – but technology is always changing, as are customer requirements, as are our own needs and wants. This is going to be an awesome upgrade both in technology, and though it might seem unrelated to the uninformed, quality of life for the two of us.

Scroll Up