Migrating – Cleaning Out the Server Closet

MigratingMigrating to a new provider or new architecture too often is not accompanied by a rationalization of resources to what the business truly needs – with 8 distinct layers of waste compounding to convert a seemingly well-managed IT budget into a wasteful one. But while we’re on that subject, why wait for a migration?

I moved a couple of months back. I think it was a pretty good deal – 50% more space for 47% more cost, and nicer.

As part of the move, a few bags of clothes went to Goodwill; a few boxes of books sit waiting to be donated or eBay’ed depending on our collective laziness / greed index.

But out of hundreds of CDs and DVDs, less than 10 went in the trash – an old Linux distribution from a venture that was defunct in 2008; and a few hardware drivers proudly bearing “Windows 98 only” on the top.

Out of generations of gadgets and gear, the closest concession to space constraints was shipping some old floppy drives and RCA cables to the basement.

While my wife may call me a hoarder in training, I’m downright ascetic relative to the IT practices of many IT infrastructure buyers. In fact, through our work, I’ve seen at least 8 levels of surplus technology spend sloshing around in the unexamined corners of data center, managed hosting, and cloud budgets:

1. Unused square footage
2. Underutilized power / facility density
3. Overprovisioned power in a rack
4. Premium infrastructure for non-premium workloads
5. Dormant servers
6. Oversized and under-virtualized physical devices
7. Oversized virtual instances
8. Zombie apps and repositories no longer actively used by anyone

While each of them have their own pedigree, they have three things in common. One is that compute resources are not tracked at the workload level – making it easy for inefficiencies to hide behind mission-critical expenses. Second is risk aversion – it’s hard to let go of compute capacity because an excess drains your budgets slowly, while a shortage creates immediate emergencies. And the third is the spectacular efficiency gains from implementing new technologies and/or using new providers that make it hard to see just how much more can be done – making the best and brightest IT leaders some of the biggest hoarders of capacity.

We’ll go through each one and identify symptoms, causes, prevention, and cure mechanisms. But just as I’d have less wasted space if someone asked me “do you really need a DVD copy of that movie after watching it twice and having it available on Netflix?” it may be good to have an outside eye take a fresh look.

1. Unused Square Footage

  • Diagnosis: Of all the layers of waste, this is the one our clients are most likely to be aware of. Visit a facility or examine a floor plan and you’ll see the empty space.
  • Causes: Over-optimistic usage forecast from the business can make it a problem right off the bat. Highly successful virtualization efforts have also sometimes reduced the IT footprint of existing compute capacity. Insufficiently successful provider contracts turn it into a long-term issue.
  • Prevention: Question usage forecasts, and ask for a range from conservative to optimistic. Build flexibility into provider contracts. Use IaaS for spillover / peak capacity.
  • Cure: See if your provider is interested in scaling down the contract – they may have other demand for the same capacity. Use a sourcing advisor to identify sub-lessees, apply additional reputational pressure, and find creative solutions.

2. Unused Power Density

  • Problem: Materially lower usage of power than the total allocated by the provider (typically less than 75% of allocated power on average).
  • Diagnosis: Review of power audits vs. contracted capacity
  • Causes: Same as above, plus lack of consolidation as servers are decommissioned – old servers are taken out from many places, but new ones are placed in fresh new racks.
  • Prevention: Identify maximum density in a rack / cabinet, and try to fill half cabinets before starting new ones.
  • Cure: A space / power rationalization. This takes some effort, so make sure you have a plan of disposing of excess capacity afterward – the cure will reduce the problem to the previous one, but it doesn’t do much good if you’re still stuck with the extra space later.

3. Overprovisioned / Custom Rack Power

  • Problem: Paying for power circuits that, if used fully, would exceed your maximum allowed power consumption
  • Diagnosis: Review maximum cabinet density vs. circuits connected to it. If you have too many circuits and are on metered power, it’s usually not a problem – you’ll only pay for your usage anyway. But if you’re paying per circuit and have ones that you can’t use without breaking the provider’s usage caps, money is being wasted.
  • Causes: Following architecture designers / manufacturers very literally without questioning discrepancies. Oftentimes the spec for the highest density configuration is followed without realizing that the actual implementation is a mid-range one.
  • Prevention: Buy metered power. If that fails, identify any discrepancies between circuits and allocated power (most suppliers will ask you “are you sure?”).
  • Cure: Convert to a metered power plan (this may take some negotiation). Rationalize circuits, ensuring that the OpEx savings over the contract term is higher than the switching cost. While you’re at it, see if you have odd circuit configurations – sometimes a 220V 50A circuit is more expensive than a 220V 60A one because it’s custom.

4. Premium Infrastructure for Non-Premium Compute

  • Problem: Paying more for a premium service that the actual workload derives no benefit from
  • Diagnosis: Some facilities and compute platforms are more expensive. A Tier IV suite in a top network interconnect facility in proximity to trading exchanges in the NY metro is about the epitome of it in the data center world, but even one of those factors can sometimes double your potential price. If you’re using that space to house your accounting systems, you’re wasting money. Similarly, if you’re using expensive database storage to house user-generated video content, it’s a waste.
  • Causes: It’s easier to put compute projects in the same location, especially one first selected for mission critical use. The platform is proven; the process is well-oiled, and no one wants the hassle of designing separate tiers of compute options.
  • Prevention: Start infrastructure design with a multi-tier strategy. What server needs to be in a Tier IV facility and what can afford a few minutes of downtime? Can your staff administer infrastructure remotely in WA or NV rather than drive to it in NYC? What content needs 20ms latency and what can be delivered at a more leisurely pace? What’s the smallest / cheapest infrastructure that can support your app’s peak usage?
  • Cure: Design a multi-tier infrastructure platform as above. Then prioritize existing projects for migration to the appropriate tier. In the case of uncommitted cloud instances, this may be instant. In the case of long-term co-lo leases, a migration may only reduce the problem to the previously discussed ones (excess square footage / power), which need to be solved first.

5. Dormant “Zombie” Servers

  • Problem: 20%-30% of servers in data centers are not accessed for as long as 6 months at a time, while continuing to draw power.
  • Diagnosis: This is one of the most well-publicized sources of waste, covered by McKinsey in 2008, Uptime Institute in 2014 and Stanford / Anthesis in 2015. It’s also not trivial to diagnose, requiring either specialty software or in-depth examination of usage. Fortunately, as buyers migrate to IaaS, usage dashboards are becoming more commonplace.
  • Causes: Apps are outdated or never adopted, but the facilities team never gets the message. Clustering / virtualization does not dynamically reassign resources. Unused servers are not put in standby mode.
  • Prevention: Tighter communication between facilities and business, especially around the app sunsetting process. Periodic audits / review of usage logs.
  • Cure: In-depth audit either using custom software such as TSO Logic or built-in dashboards

6. Oversized / Under-Virtualized Physical Servers

  • Problem: Dedicated servers for apps that use only a fraction of their resources; compute pools in virtualized servers that vastly exceed the collective consumption (e.g. 6 virtual instances / physical device where 10 could fit in comfortably). Everything below the red line below.

Migrating

  • Diagnosis: Review of system logs for peak CPU, memory, and I/O utilization. If these are consistently below 70%, more in-depth review of the specific issues encountered.
  • Causes: Virtualization is either never done or performed sloppily (e.g. with one customer, users were allowed to allocate an entire dedicated machine from the internal private cloud, so that’s what many of them did by default).
  • Prevention: Virtualization / private cloud efforts that allow for multiple instance sizes and making sure the appropriate ones are actually used.
  • Cure: If not virtualized, start down the road. If virtualized, refine approach as above.

7. Improperly Sized Virtual / Cloud Instances

  • Problem: The uber-example: small, rarely used apps are allocated a reserved XXL memory-intensive instance on AWS.
  • Diagnosis: There’s a host of services to analyze and right-size cloud instances, including ours
  • Causes: Buyers guess at what their requirements may be, and guess conservatively “just in case.”
  • Prevention: Periodic / ongoing review of utilization to scale down oversized instances and move the infrequently used ones to on-demand.
  • Cure: In-depth audit and optimization.

8. Unused Apps

  • Problem: An app continues to run, perform regular updates, pull in data and output reports. No one actually consumes any of it.
  • Diagnosis: No users access the system; no other app that is being used draws on it
  • Causes: Apps are either superseded by something newer or never adopted by the business, but their structure requires ongoing compute workloads, thus avoiding detection as a “zombie server” described above.
  • Prevention: Periodic audit of running apps and identifying both business users and other app dependencies.
  • Cure: If an audit reveals an app is not used, put it on hold and see if anyone complains.

When to Perform an Optimization

At minimum, a thorough optimization of all of the above layers of waste should be performed when migrating to a new provider / facility, implementing cloud migration, or even performing a hardware refresh. Better yet, identify all the waste immediately and then get some help to work out the right solution with your existing providers or migrating to the new, better fitting one.

Any questions? Don't hesitate to ask us.

Leave a Reply