Month: January 2019

On-Call Risk

I’m spending an hour or more running on most days now, with a couple hours and steadily increasing each Saturday. I’m wanting to get back to spending time in nature — backpacking, hiking, and other activities that take us away from civilization. The issue with that, is that I’m always on call. Kelly and I make our living two ways. First, I have a work from home position as a Network Security Engineer for a large corporation, that provides our base salary. It’s flexible, but demands 40+ hours a week of my attention just like any other full-time job. I love the company and the team I work for – there have been some rough patches, but the only way I’ll give up this position is if it’s forcefully taken from me. Second, we have our own consulting and hosting company. Between the two, we have enough income to live comfortably — but if we lost one or the other, we’d eventually have to find a replacement. For my day job, a coworker and I have arranged a rotating series of on-call shifts. The on-call load is relatively light, and it’s usually pretty easy for me to arrange to not be on call if I’m going to be busy, by just giving him a heads up. He tends to stay close to home and works in the UK, where he gets paid extra for being on call, so he is usually happy to soak up these hours. On the flip side, for the consulting company, all the on-call work is on me. I have a friend who will jump in and take care of things from time to time, but he’s not a full time employee – rather an occasional contractor – and I can’t pay him to sit around waiting on things to break. Not to mention, he has other things that demand his time. We have two racks of equipment for our hosting services. The majority of that equipment is several years old – architected on a strict budget, with many ad-hoc changes made over the years, and too many single points of failure for me to feel comfortable taking an extended leave of absence. We have some customers who rely on our hosted services for their daily workflow and income, pay us accordingly, and demand continuous uptime. We moved into the first rack back into 2011, and the contract is coming up for renewal again this June. Since we’re negotiating a new contract, or a possible move to a new facility, I decided to take a fresh look at our architecture to alleviate all single points of failure. My new goal is to make it so that if failures occur, I don’t have to be around to immediately address them. There are a few things in the racks that belong to customers – we don’t necessarily take responsibility for them, and if the hardware has issues, then those issues are theirs to repair. I’ll disregard those in this design, if I can’t convince the customers to move them on-premesis or to move to high-availability solutions. First is addressing the switching infrastructure. We currently have a single switch in each rack. I’ve spent the last few years working with HPE Comware equipment for my primary employer, and have been really impressed with the hardware. Because we have a mix of 1G and 10G equipment, I ended up buying two 5900AF-48G-10XG-4QSFP+ and two 5820AF-24XG switches. That’s moving from one 48-port switch per rack, to four switches (hoping to condense both racks to a single one, to reduce monthly expenditures). These HPE switches support IRF, Intelligent Resilient Framework, a type of stacking that allows for advanced configurations such as LACP bonding across chassis. The switches themselves have redundant fans and power, and if in pairs, you can connect servers to both switches with bonding, to survive complete switch failure, cable failure, optic failure, and a few other usually catastrophic scenarios. I have one IRF stack containing the 1G switches, another consisting of the 10G switches, and a 4x10G bond between them. These switches also support BGP and advanced hardware ACL processing, which allows me to alleviate the redundant routers we’re currently running. I’ll be replacing the services the switches can’t do with virtual routers. All of the described infrastructure below will be exclusively 10G, except for management and out of band. Next, I wanted to replace the aged VMware infrastructure that consists of several individual servers with some cross-server replication, with a true HA cluster. I ended up ordering three servers that contain dual 8-core CPUs, 384GB RAM, and 4x 10Gb NICs each to start with. They’re running Proxmox, which has been configured for high availability. Networking consists of two bonds that stagger between both NICs (each bond consists of a port from each of the two cards) and switches. The first bond is for data, and the second dedicated solely to storage. This arrangement gives me N-1 RAM capacity, or 768GB to start with, to maintain full ability to sustain a node failure. I currently have about 500GB in mind to migrate, so it will give me a moderate amount of room to grow. One of the great things about this solution is that I can add additional hosts at any time – and even re-task some of the existing hardware in the rack to serve as hosts in this cluster once I have the guests migrated to the new infrastructure. Doing so would be a free or low-cost way to add another half terabyte of RAM, or more. Last, and the thing I am most excited about, is the storage solution. I would have rebuilt the hypervisor solution a long time ago if I had a centralized commodity storage solution I was comfortable with, or the ability to invest in an off the shelf one that would meet all my requirements (they start in the tens of thousands and skyrocket from there). Ultimately, was able to engineer a solution that checks off all the boxes, using technologies I’m already familiar with, by accepting that a couple key technologies I rejected a long time ago have since become viable. The end result physically consists of two storage heads running Debian and dual-controller SAS expanders (which we can easily add at any time to increase capacity) that hold the drives. Without getting in a ton of detail that is beyond the scope of this blog, these are a few of the failure scenarios I’ve simulated while running intensive stress tests and/or benchmarks, without issue: Cutting power to each 5900AF-48G-4XG-2QSFP+ switch (one at a time) Cutting power to each 5820AF-24XG switch (one at a time) Bulk live migration of VMs Cutting power to a host with active VMs (they die, but automatically boot up on other hosts) Cutting power to the active storage head Cutting power to the standby storage head Pulling the active SAS controller out of the expander Pulling the standby SAS controller out of the expander Randomly disconnecting network cables for A or B side connections Cutting A or B side PDUs For all these solutions – switches, hypervisors, and storage – another upside is that we can perform upgrades and maintenance without impacting customers. I’ve spent a week building this – the last three or four days of which consisted of extensive load/stress testing, and failure simulation. I have tested several failure scenarios, under heavy load, with excellent results. The new stack will hopefully be racked in the next month, if early contract negotiations with our datacenter pan out as I hope. Once this solution is in place, VMs migrated over, and the old gear retired, I’ll be a lot more comfortable being out of reach and away from the computer. I’m sure at least a few of our customers will be happier being on true HA solutions. Performance looks like it will increase nicely. It’s a win-win, really, even though it did cost a pretty penny in new hardware. I’ve been avoiding this for a long time – not wanting to invest, not believing that the technologies needed for the storage solution were mature enough (ZFS on Linux, primarily) – but technology is always changing, as are customer requirements, as are our own needs and wants. This is going to be an awesome upgrade both in technology, and though it might seem unrelated to the uninformed, quality of life for the two of us.

Daily Breakfast

I don’t really consider myself a breakfast person. A lot of people tout it as the most important meal of the day, but historically it is one that I’ve skipped far more than I’ve consumed. Since starting to be more active, I’ve started making smoothies when I wake up, unless I’m substituting pancakes or another special dish in their place. My recipe is more or less a basic guideline, based loosely on a No Meat Athlete blog post, that I sub ingredients into for variety. The basic outline, which serves two people, is: 3 frozen bananas (we buy them fresh, allow them to ripen, peel, quarter, and vacuum seal them three per bag) 2 cups of unsweetened frozen fruit (usually two varieties) 2 tbsp Flax Seed 1/4 cup fresh walnuts OR 4 scoops Orgain Vanilla Bean Protein Powder 2 or 3 large handfuls of spinach My favorite combination is peach, pineapple, and Orgain. Other fruit we keep in the house and mix-n-match are blueberries, blackberries, raspberries, strawberries, rhubarb, mango, cherries, and oranges. We usually reserve the Orgain for use on Saturday, which is our long workout day. One final thing to note is that several years ago we went through several cheap blenders making smoothies, dips, and creams, burning up the motors or just not being satisfied with the results. Eventually we invested in a Vitamix 5200. That was seven years ago – and it still works flawlessly. We recently upgraded to a refurbished 5300 with the newer pitcher design, and delegated the 5200 to the RV. I highly recommend investing in a Vitamix – they are amazing machines and are built to last a lifetime of heavy use.

Foam Rolling & Roll Recovery

We haven’t had much to post about lately, as we’ve been getting through the holidays and getting back into the groove of normal life. Things have been both dull and busy – a lot of monotonous catching up to do, mainly, with a few short memories sprinkled in, like spending new years with some great friends of ours from out of town. Kelly’s ankle is healing up nicely and she’s back to riding regularly and starting to jog again, and I’m still working to tick off at least 24 miles per week. My running coach has been after me to start foam rolling and focusing on recovery for a while. We have all the gear required for foam rolling, which consists of a few foam cylinders in different sizes and some lacrosse balls. Despite my best attempts, I can’t seem to make a routine of it. You have to get on the floor, or maybe a mat or blanket, and roll around while supporting yourself and manipulating your body to put pressure on sore areas – something that can be a substantial and time consuming, not to mention painful, workout itself. Roll Recovery makes an alternative solution that accomplishes the same thing. Touted as a “self-massage tool [that] takes the extra effort out of an intense foam rolling session“, it basically is a vice that you clamp your limbs into move along the length of them. It doesn’t address the last complaint I made about foam rolling – the pain, possibly even being worse – but does a great job of reducing the effort and time required. The R8 seems to have a cult following and a lot of obsessed users, so I figured I’d try it for myself. So far, I’ve found I pick it up when idle throughout the day and use it while watching TV or even listening to phone calls (meetings). That sure beats blocking out a 45min window to toss around on the floor while being able to concentrate on nothing else. Kelly is making similar habits with it. We also picked up the R3 and their stretch mat. The R3 is a foot roller, specifically designed to address plantar fasciitis, something I’ve recently began to experience, and the mat is a gimmicky number that gives you a platform you can stretch on without falling off of – something that may work for Kelly, but doesn’t seem to work for my 6’0″ frame. Despite that it may be a tad too small for me, and that I don’t mind sitting on the ground most of the time, it seems perfect for throwing in the truck and taking to the lake or trailhead with me. To date, as you probably assume, I haven’t been great about massage or rolling or recovery. I don’t have enough previous recovery experience to compare with, but what I can say is that my run yesterday (after two days of using the R8) was one of the best ones I’ve experienced in a while.

Scroll Up