Hardening a Live OpenStack Cluster to CIS Level 2 Without Taking It Down

The CIS Benchmarks assume two things that production almost never gives you: a freshly installed host and a maintenance window. The benchmark is written as if you are hardening a machine before it does any real work, with nobody depending on it yet. That is the easy case, and it is not the case anyone running a cloud is actually in.

The interesting problem is the other one. You have a cluster that is serving traffic. It has tenants on it, or workloads, or both. You need it to come out the far side of a Level 2 hardening pass measurably more secure, fully documented, and without a single customer noticing that anything happened. Running the tool is the trivial part. Doing it on something that cannot go down is where the work is.

These are the notes we wish we had before our first pass. They are written against Ubuntu and Canonical's Ubuntu Security Guide (USG), which is the tooling we use, but the lessons translate to OpenSCAP, to STIG remediation, and to any benchmark-driven hardening on a running system.

Start from a tailored profile, not the stock benchmark

The first instinct is to run the Level 2 profile as published and chase the score toward 100. Do not do that on a cluster that has a job. Stock CIS Level 2 contains controls that are correct for a generic server and wrong for a node whose entire purpose is to route packets and run hypervisors.

The obvious example: a benchmark will want IP forwarding disabled. A compute node forwarding tenant traffic cannot have IP forwarding disabled, because forwarding is the point. There is a whole category of controls like this, where the "secure" setting fights the role the machine plays. A host-level firewall manager can collide with the network layer you already run. Disk and mount-layout expectations assume a partitioning scheme you did not use and cannot change without a reinstall. Tooling that duplicates something your monitoring already does better adds noise without adding safety.

The right move is to build a tailored profile: the benchmark, minus the controls that do not apply to this role, with each removal written down next to the compensating control that covers the same risk a different way. That written list of deviations is the actual deliverable. The numeric score is a vanity metric. When a security reviewer or an auditor looks at your posture, the question they ask is not "what did you score," it is "what did you turn off and why." A clean, reasoned set of documented exceptions is worth more than a 100 that was reached by ignoring what the relaxations cost you.

Decide the tailoring before you touch a node. Changing your mind about a deviation after a remediation run means another remediation run, which means another reboot, which on a live cluster means another drain cycle. Get it right on paper first.

Check the package state before you touch anything

The single highest-leverage thing you can do before a remediation run is boring: confirm that the package manager is in a clean state.

sudo dpkg --audit
sudo dpkg --configure -a

A half-configured package is invisible day to day. The service runs, nothing logs an error, and you have no reason to look. But a remediation tool that needs to reconfigure PAM, or flip an AppArmor profile, or rewrite an SSH daemon config, depends on those packages being fully configured. If `libpam-pwquality` or an AppArmor package is sitting half-installed, the hardening step that touches it will half-apply and report success. You will not find out until something authentication-related breaks weeks later, long after you have stopped associating it with the hardening pass.

Run the audit before the fix. Then run it again after the fix, because the remediation itself can leave a package half-configured, and a host that was clean going in is not guaranteed to be clean coming out. Two minutes of checking on each side saves you a debugging session that will not look anything like a hardening problem when it surfaces.

The tool will lie to you if you let it pipe

This one cost us a full drain-and-reboot cycle before we understood it, so it gets its own section.

USG's fix run prints a long stream of progress to stdout. The natural thing to do, watching a long-running command, is to pipe it somewhere readable:

sudo usg fix --tailoring-file ... | grep -i fail   # don't do this

The moment `grep` has what it wants, or `head` hits its line count, it closes the read end of the pipe. The fix process gets a SIGPIPE on its next write and dies. Its exit code comes back 0. The remediation is half done. Nothing tells you, because from the shell's point of view the command finished cleanly.

The tell is subtle and worth memorizing: your post-fix audit looks identical to your pre-fix audit. Same failures, same count. If a remediation run reports success and changes nothing, you piped it into something that closed the pipe.

Drain the output to a file and read the file afterward:

sudo usg fix --tailoring-file /etc/usg/your-tailoring.xml --only-failed \
  | sudo tee /root/usg-fix-$(date +%s).log >/dev/null

While you are at it, set the locale explicitly for the run. Remediation scripts that parse their own command output are sensitive to locale, and a fix run launched in a stripped-down or mismatched environment can misbehave in ways that are maddening to reproduce. An explicit, known environment removes a variable you do not want to be debugging at two in the morning.

AppArmor does not fully enforce itself

A hardening pass will move AppArmor toward enforcing mode, and it is easy to assume that means every profile is now enforcing. It does not. Some profiles get loaded in complain mode, some are skipped, and the pass does not always leave you where you think it does. Check the actual enforce count after the run rather than trusting that the box is ticked.

There is also a specific trap worth knowing about. A profile that pulls in a local include file will refuse to enforce if the include does not exist, and on a fresh setup that local include is sometimes just a placeholder that was never created. The profile sits there un-enforced, the tooling reports a problem it words unhelpfully, and the fix is a one-line `touch` to create the empty include the profile expects. Knowing that this is the failure mode turns a confusing afternoon into a thirty-second fix.

Drain first, one node at a time, and respect quorum

Now the part that makes it a live-cluster operation rather than a single-host chore.

A hardening pass ends in a reboot. Some controls only take effect after the kernel comes back up, so there is no skipping it. That means every node you harden is a node you are about to take down and bring back, and the whole game is making that invisible to whatever the node is serving.

Cordon the node out of rotation before you start. Drain its workloads. If it sits behind a load balancer, put that backend into maintenance so health checks stop sending it traffic rather than discovering it is gone the hard way. Move or live-migrate anything stateful off it.

Then go one node at a time, and watch your quorum. Anything that runs as a quorum-based cluster across your control nodes – the database layer, the message broker – has a hard rule: you can lose one member and stay healthy, you cannot lose two. Reboot one of three and the cluster rides through it. Reboot two and you have turned a planned maintenance into an outage, and depending on the failure mode of the thing you broke, possibly a messy recovery rather than a clean one. Bring each node fully back, confirm cluster health is green across storage, broker, and database, and only then move to the next. Patience here is not optional; it is the difference between a hardening pass and an incident report.

Budget for slower boots afterward. A freshly hardened node loads more audit rules, may be running FIPS-validated crypto, and on a containerized control plane has to cold-start its whole container fleet. A boot that normally takes five minutes can take ten or more the first time. Know that number going in so you do not start a recovery procedure at minute eight for a node that was going to come back fine at minute eleven.

The reboot will find your physical-layer surprises

The most frustrating way for a careful, drained, quorum-aware reboot to go wrong has nothing to do with hardening. On real hardware, a server can pause its boot at a firmware prompt waiting for a human to acknowledge something, a changed disk, a configuration warning, whatever the BIOS decided was worth stopping for. A headless `reboot` issued over SSH will appear to hang forever, because the machine is sitting at a prompt no one is watching.

Have the out-of-band console open during every reboot, on every node, even the ones you have rebooted a hundred times. It is the only way to see the prompt, and clearing it is one keypress. Without it you are staring at a node that will not come back and reaching for a recovery plan you do not need.

And before any of this, snapshot the configuration. A tarball of `/etc` taken immediately before the remediation run turns rollback from a reinstall into a five-minute restore. You will probably never use it. Take it anyway.

What "done" actually looks like

The output of a good hardening pass is not a screenshot of a high score. It is three things.

A durable baseline: the nodes pass a tailored profile, and they keep passing it, because the configuration is managed and not hand-edited back to broken the next time someone is in a hurry.

A written set of exceptions: every control you chose not to apply, the role-based reason, and the compensating control that covers the underlying risk. This is what survives a security review. This is the artifact.

A trend, not a snapshot: configuration drifts. The pass rate you had on the day you finished is not the pass rate you will have in ninety days unless something is watching. Tie the baseline into whatever monitors host state for you so that a node drifting away from its profile generates a signal, the same way an expiring certificate or a saturating disk would.

We took our nodes into the high nineties against a tailored profile this way, hypervisors and control plane both, with every deviation documented and the whole thing rolled across a running cluster without a customer-visible blip. None of it was hard in the sense of requiring deep wizardry. It was hard in the sense of requiring discipline: tailor first, check the package state, never trust a piped fix run, respect quorum, watch the console, and treat the written exceptions as the real deliverable.

This is part of how we run Open Edge Cloud, our managed OpenStack platform, where the hardening baseline and its drift detection are controls we operate continuously rather than a one-time exercise. If your team is staring down a CIS or STIG benchmark on infrastructure that cannot afford a maintenance window, that gap between "ran the tool" and "did it on a live system without breaking anything" is exactly the kind of work Joscor takes on.

Comments

Leave a Reply Cancel reply