nexus: add failure reasons to `Failed` VMMs using a `VmmState` domain type by hawkw · Pull Request #10425 · oxidecomputer/omicron

hawkw · 2026-05-08T23:14:23Z

Okay, so this is a large and unpleasant change, and I apologize from the bottom of my heart to those of you who are being forced to review it. Nonetheless, I do feel that this is justified and perhaps the best of the many bad solutions to the problem. And, I think that the actual change is not ultimately that complex, despite the number of files which had to be touched.

In summary, this is an attempt to solve the same problem as #10285: when a VMM is moved to the Failed state, it sure would be nice if we could write down why that happened somewhere where it is possible to see that information ever again. Unfortunately, that change ran afoul of the present cyclic dependency between the Nexus internal and sled-agent APIs: we cannot easily add additional data to the SledVmmState type in sled_agent_types, because it is part of both the server-side versioned sled-agent API and the client-side versioned Nexus API. My previous attempt to resolve this stickiness, #10303, was reasonably deemed somewhat too sketchy, and rather than moving forwards with that, I think we are better off waiting for client-side versioning to be a thing which is possible.

Therefore, this change punts on including more specific failure details for failed VMMs reported by sled-agent. Instead, when the sled-agent reports that a VMM has failed, we just synthesize a VmmFailureReason::FromSledAgent. When the failure is detected by Nexus without the sled-agent reporting the failure, we record one of several other failure reasons. This is implemented by adding a new layer of type indirection between the sled-agent API types and the DB model types: a nexus_types::instance::SledVmmState. This way, when Nexus would like to internally record a failed VMM state, it can provide a failure reason, and at the boundary, when a failed state is received from a sled-agent, the From<sled_agent_types::instance::VmmInstanceState> for nexus_types::instance::SledVmmState can just fill in the failure reason.

The other thing that this indirection gives us, which is either good or bad (?), is that we can make the nexus_types::instance::VmmState enum have a VmmFailureReason in its Failed state, so the shape of the domain type enforces that there is a failure reason if the VMM is failed, and there is not a failure reason otherwise. Since these are separate columns in the database, the DB model type can represent a bunch of invalid states, which I didn't love. Unfortunately, having an additional domain type for VmmState is part of why this change is so massive, as it requires an additional layer of type conversions and such, and results in touching way more files. I could be convinced this is not the right thing, but I still kind of loosely think it's probably worth it.

Most of this change is just updating everything to use the new types.

Closes #10285

hawkw

Some notes to hopefully make this loud and awkward PR a bit more comfortable to review. Also, I'm tagging in @sunshowers in the hopes that they will help to defend my decision to have three versions of these types in case anyone else gets mad at me for that :)

hawkw · 2026-05-11T22:16:07Z

            self.outcome = CheckOutcome::Failure(Failure::SledExpunged);
-            return mk_failed();
+            return mk_failed(VmmFailureReason::SledExpunged);


In future, I would like to replace the Failure type which is used internally to the instance_watcher task with just using VmmFailureReason --- I didn't do that in this PR because the diff has gotten big enough already, but will do that in a subsequent change if y'all decide you don't totally hate what I've proposed thus far.

hawkw · 2026-05-11T22:16:36Z

+#[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)]
+pub struct SledVmmState {
+    pub vmm_state: VmmRuntimeState,
+    /// The current state of any inbound migration to this VMM.
+    pub migration_in: Option<sled_agent::MigrationRuntimeState>,
+
+    /// The state of any outbound migration from this VMM.
+    pub migration_out: Option<sled_agent::MigrationRuntimeState>,
+}


this is the "domain type" version of SledVmmState

hawkw · 2026-05-11T22:17:02Z

+impl SledVmmState {
+    pub fn migrations(&self) -> Migrations<'_> {
+        Migrations {
+            migration_in: self.migration_in.as_ref(),
+            migration_out: self.migration_out.as_ref(),
+        }
+    }
+}


This just moved from its previous home in sled_agent_types::instance and has not otherwise changed.

hawkw · 2026-05-11T22:17:36Z

+/// The dynamic runtime properties of an individual VMM process.
+#[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)]
+pub struct VmmRuntimeState {
+    /// The last state reported by this VMM.
+    pub state: VmmState,
+    /// The generation number for this VMM's state.
+    #[serde(rename = "gen")]
+    pub generation: Generation,
+    /// Timestamp for the VMM's state.
+    pub time_updated: DateTime<Utc>,
+}


This is the same as the VmmRuntimeState in sled_agent_types but using the domain type version of VmmState, which has a failure reason...

hawkw · 2026-05-11T22:18:33Z

+    /// Transitions the VMM state from the current state to `new_state`,
+    /// returning a `VmmRuntimeState` representing the VMM after transitioning
+    /// to the new state.
+    ///
+    /// If `new_state` is the same as the current state, returns a copy of
+    /// `self` without any changes.
+    pub fn transition(&self, new_state: VmmState) -> Self {
+        if new_state == self.state {
+            return self.clone();
+        }
+        Self {
+            state: new_state,
+            generation: self.generation.next(),
+            time_updated: Utc::now(),
+        }
+    }


This is a helper for reducing the boilerplate of doing these state transitions, which I thought was nice --- especially because it enforces the property that you bump the generation when doing a state transition, which you previously had to remember to do.

hawkw · 2026-05-11T22:20:48Z

+    /// Returns `true` if this is a terminal VMM state.
+    ///
+    /// A VMM in a terminal state will never transition to another state.
+    pub fn is_terminal(&self) -> bool {
+        matches!(self, VmmState::Destroyed | VmmState::Failed(_))
+    }
+
+    /// Returns `true` if the VMM is in a `Failed` state.
+    pub fn is_failed(&self) -> bool {
+        matches!(self, VmmState::Failed(_))
+    }
+
+    /// Returns `true` if the VMM is in a state where it is safe to
+    /// deallocate its sled resources and mark it as deleted.
+    pub fn is_destroyable(&self) -> bool {
+        matches!(
+            self,
+            VmmState::Destroyed | VmmState::Failed(_) | VmmState::SagaUnwound
+        )
+    }
+
+    /// Returns `true` if the VMM exists on a sled.
+    ///
+    /// This is `false` for VMMs that have not yet been created, have been
+    /// destroyed, or were produced by an unwound saga and will never exist.
+    pub fn exists_on_sled(&self) -> bool {
+        !matches!(
+            self,
+            VmmState::Creating | VmmState::SagaUnwound | VmmState::Destroyed
+        )
+    }


These predicates previously existed in the nexus_db_model version of the type; I've moved them here.

hawkw · 2026-05-11T22:21:23Z

+impl From<sled_agent::VmmState> for VmmState {
+    fn from(state: sled_agent::VmmState) -> Self {
+        match state {
+            sled_agent::VmmState::Starting => VmmState::Starting,
+            sled_agent::VmmState::Running => VmmState::Running,
+            sled_agent::VmmState::Stopping => VmmState::Stopping,
+            sled_agent::VmmState::Stopped => VmmState::Stopped,
+            sled_agent::VmmState::Rebooting => VmmState::Rebooting,
+            sled_agent::VmmState::Migrating => VmmState::Migrating,
+            sled_agent::VmmState::Failed => {
+                // If we are converting a state received from a sled-agent that
+                // indicates that the VMM is failed, the failure reason is
+                // implicitly "from_sled_agent" for now.
+                VmmState::Failed(VmmFailureReason::FromSledAgent)
+            }
+            sled_agent::VmmState::Destroyed => VmmState::Destroyed,
+        }
+    }
+}


This is where we now synthesize the FromSledAgent failure reason when receiv9ing things from the sled-agent.

hawkw · 2026-05-11T22:21:42Z

+impl From<VmmState> for sled_agent::VmmState {
+    fn from(state: VmmState) -> Self {
+        match state {
+            // The `Creating` state is internal to Nexus; the outside world
+            // should treat it as equivalent to `Starting`.
+            VmmState::Starting | VmmState::Creating => {
+                sled_agent::VmmState::Starting
+            }
+            VmmState::Running => sled_agent::VmmState::Running,
+            VmmState::Stopping => sled_agent::VmmState::Stopping,
+            VmmState::Stopped => sled_agent::VmmState::Stopped,
+            VmmState::Rebooting => sled_agent::VmmState::Rebooting,
+            VmmState::Migrating => sled_agent::VmmState::Migrating,
+            VmmState::Failed(_) => sled_agent::VmmState::Failed,
+
+            // The `SagaUnwound` state is internal to Nexus; the outside world
+            // should treat it as equivalent to `Destroyed`.
+            VmmState::Destroyed | VmmState::SagaUnwound => {
+                sled_agent::VmmState::Destroyed
+            }
+        }
+    }
+}


This conversion already existed but was defined on the nexus_db_model type rather than here.

hawkw · 2026-05-11T22:21:58Z

+impl From<VmmState> for omicron_common::api::external::InstanceState {
+    fn from(value: VmmState) -> Self {
+        use omicron_common::api::external::InstanceState as Output;
+
+        match value {
+            // An instance with a VMM which is in the `Creating` state maps to
+            // `InstanceState::Starting`, rather than `InstanceState::Creating`.
+            // If we are still creating the VMM, this is because we are
+            // attempting to *start* the instance; instances may be created
+            // without creating a VMM to run them, and then started later.
+            VmmState::Creating | VmmState::Starting => Output::Starting,
+            VmmState::Running => Output::Running,
+            VmmState::Stopping => Output::Stopping,
+            // `SagaUnwound` should map to `Stopped` so that an `instance_view`
+            // API call that produces an instance with an unwound VMM will appear to
+            // be `Stopped`. This is because instances with unwound VMMs can
+            // be started by a subsequent instance-start saga, just like
+            // instances whose internal state actually is `Stopped`.
+            VmmState::Stopped | VmmState::SagaUnwound => Output::Stopped,
+            VmmState::Rebooting => Output::Rebooting,
+            VmmState::Migrating => Output::Migrating,
+            VmmState::Failed(_) => Output::Failed,
+            VmmState::Destroyed => Output::Destroyed,
+        }
+    }
+}


This conversion already existed but was defined on the nexus_db_model type.

hawkw · 2026-05-11T22:23:50Z

+#[derive(Copy, Clone, Debug, Default)]
+pub struct Migrations<'state> {
+    pub migration_in: Option<&'state sled_agent::MigrationRuntimeState>,
+    pub migration_out: Option<&'state sled_agent::MigrationRuntimeState>,
+}
+
+impl Migrations<'_> {
+    pub fn empty() -> Self {
+        Self { migration_in: None, migration_out: None }
+    }
+}


This used to be defined in sled-agent-types but wasn't actually an API type at all, it just lived in the same module as the API types. It lives here now.

this is slightly sad because i prefered the `NoSuchVmm` naming, but would like the `IntoStaticStr` to remain `no_such_instance` so that a subsequent change to make the reason types in `instance_watcher`'s metrics use the new `VmmFailureReason` type, and it already uses this string, and i don't wanna change that...

smklein · 2026-05-15T16:30:34Z

+        if state == &db::model::VmmState::Failed {
+            println!(
+                "{indent}{FAILURE_NOTE:>width$}: {}",
+                reason.description()


Is this description going to be different from the to_string() impl above (on line 8021)?

description() provides a human-readable sentence describing what this reason means, while the to_string() impl returns the same string used as the database enum value. I wanted to include both here, so that you can see both the actual value in the DB and a more helpful explanation of what that means.

smklein · 2026-05-15T17:14:49Z

+     * A request to the sled-agent received a response indicating that this
+     * VMM is no longer present on the sled.


This is the case which would happen on sled agent rebooting (or the whole sled rebooting), correct?

Is it worth calling that out explicitly? Are we aware of other situations where "no such instance" gets returned? I guess like "slow request got superseded by a different nexus"

I mostly bring this up to distinguish from the "sled off" and "from sled agent" cases

Is it worth calling that out explicitly? Are we aware of other situations where "no such instance" gets returned?

It's returned any time that the sled-agent restarts, whether due to the sled rebooting or a panic.

I guess like "slow request got superseded by a different nexus"

I don't think we would ever see this due to a race with another Nexus trying to stop an instance or anything like that. Because the sled-agent handles all instance requests from nexus serially within that instance, we will have either already written the Destroyed state to the database and therefore never write a Failed state to the db (because the VMM's generation has been incremented), or we will drain all other inflight requests to get the VMM's state when it is destroyed and respond with the destroyed state. This really only happens if sled agent has restarted (or if there's a very werid sled-agent bug I guess).

smklein · 2026-05-15T17:16:58Z

+impl VmmFailureReason {
+    pub fn description(&self) -> &'static str {
+        match self {
+            Self::Prehistoric => "<VMM failed prior to recorded history>",


nit: Are the <> characters actually useful here

probably not!

smklein · 2026-05-15T17:20:01Z

+    Eq,
+    Serialize,
+    Deserialize,
+    strum::Display,


Should we use:

#[strum(to_string = "reason")]

And drop description? idk. Seems a bit weird to have both

I don't really understand this suggestion --- the Display impl is intended to format the concise name of the failure reason, which matches the value used in the db enum and (eventually) the instance_watcher metrics labels, which (as I noted in #10425 (comment)) have not yet been replaced with this enum, but which I would like to replace with the VmmFailureReason type in a subsequent PR. I didn't do that here, because there is a bunch of refactoring I would like to do in that task, and I didn't want to add even more code to this PR.

If what you're suggesting here is that the Display impl could print the human readable sentence and the db-enum-like short label could just be returned by a separate label() or as_str() or something, I'm open to changing it. We do need both, but I don't have a very strong opinion about which one is the Display impl.

eh, it's fine, if you're trying to use them as "labels vs full descriptions" I think it's fine as-is. I brought it up because I thought "we might only care about full descriptions", but it sounds like you have a valid use-case for the labels as distinct things too

smklein · 2026-05-15T17:23:23Z

+                           "error" => ?inner,
+                           "reason" => %reason);
+
+                    if let Err(set_failed_error) = osagactx


Compared to the old code - which only called mark_vmm_failed when inner.vmm_gone() was true - is it intentional that we do this marking for "any vmm failure reason"?

Yes, this is intentional. Presently, the behavior here is identical to the previous behavior: SledAgentInstanceError::vmm_failure_reason only returns Some(VmmFailureReason::NoSuchInstance) if vmm_gone() returns true:

omicron/nexus/src/app/instance.rs

Lines 193 to 201 in dc07899

/// If this error indicates that the VMM should be marked as `Failed`,

/// returns the reason for the failure.

pub fn vmm_failure_reason(&self) -> Option<db::model::VmmFailureReason> {

if self.vmm_gone() {

Some(db::model::VmmFailureReason::NoSuchInstance)

} else {

None

}

}

The reason I changed this to optionally return any failure reason is that I wanted to set up for a subsequent change to make the code which attempts to tell the sled-agent to do a state change follow up on CommunicationErrors by checking if the sled is on, similarly to how instance_watcher now does. If we make that subsequent change, I anticipate changing SledAgentInstanceError to also be able to return the SledOff variant.

smklein · 2026-05-15T17:25:47Z

-                    generation: Generation(vmm2.generation.0.next()),
+                &nexus_types::instance::VmmRuntimeState {
+                    time_updated: Utc::now(),
+                    generation: vmm2.generation.0.next().next(),


Why call next() twice here?

The previous code was not accurately simulating the actual behavior. vmm2 was the original VMM inserted into the db (generation 1):

omicron/nexus/db-queries/src/db/datastore/vmm.rs

Lines 531 to 550 in dc07899

let vmm2 = datastore

.vmm_insert(

&opctx,

Vmm {

id: Uuid::new_v4(),

time_created: Utc::now(),

time_deleted: None,

instance_id: instance_id.into_untyped_uuid(),

sled_id: SledUuid::new_v4().into(),

propolis_ip: "10.1.9.42".parse().unwrap(),

propolis_port: 420.into(),

cpu_platform: VmmCpuPlatform::SledDefault,

time_state_updated: Utc::now(),

generation: Generation::new(),

state: db::model::VmmState::Running,

failure_reason: None,

},

)

.await

.expect("VMM 2 should be inserted successfully!");

Subsequently, we insert a new state to transition it to Running, bumping the generation to 2:

omicron/nexus/db-queries/src/db/datastore/vmm.rs

Line 596 in dc07899

&vmm2.runtime().transition(VmmState::Running),

So here we have to bump the generation twice for the insert to apply the next state.

I think this could be refactored a bit so that we store the updated state when we transition it to running, and just transition it again, which might be neater?

eh, I'll defer that to you! Just figured I'd flag it, because it seemed "mostly unrelated to the rest of this PR" but still like a behavioral change

hawkw added 9 commits May 8, 2026 10:51

start adding vmm failure reasons from nexus only

a7c9ded

start breaking a whole bunch of stuff

bb3bf07

start fixing some of the stuff i broke

c0a719d

put back from_vmm_state

117999e

more unbreakening stuff

458277b

tests for consistency between nexus types and db types

fefa831

migration

48ffdb3

more thing

c349099

you have to SET LOCAL disallow_full_table_scans = off

2368b70

hawkw changed the title ~~[WIP] GIANT HORRIBLE VMM STATE REFACTOR~~ nexus: add failure reasons to Failed VMMs using a VmmState domain type May 11, 2026

hawkw marked this pull request as ready for review May 11, 2026 22:12

hawkw commented May 11, 2026

View reviewed changes

hawkw requested review from iximeow and smklein May 11, 2026 22:26

hawkw self-assigned this May 11, 2026

hawkw added 4 commits May 11, 2026 15:28

Merge branch 'main' into eliza/failure-reason-2

58e21fb

ARGH I FORGOT TO SAVE

8028475

I FORGOT TO SAVE THIS ONE TOO

dc07899

smklein approved these changes May 15, 2026

View reviewed changes

hawkw added 2 commits May 15, 2026 11:59

remove <>

d5f0651

get rid of .next().next()

88d52ec

hawkw enabled auto-merge (squash) May 15, 2026 19:19

Merge branch 'main' into eliza/failure-reason-2

63ede60

hawkw merged commit ebeb706 into main May 16, 2026
16 checks passed

hawkw deleted the eliza/failure-reason-2 branch May 16, 2026 01:14

		* A request to the sled-agent received a response indicating that this
		* VMM is no longer present on the sled.

	/// If this error indicates that the VMM should be marked as `Failed`,
	/// returns the reason for the failure.
	pub fn vmm_failure_reason(&self) -> Option<db::model::VmmFailureReason> {
	if self.vmm_gone() {
	Some(db::model::VmmFailureReason::NoSuchInstance)
	} else {
	None
	}
	}

	let vmm2 = datastore
	.vmm_insert(
	&opctx,
	Vmm {
	id: Uuid::new_v4(),
	time_created: Utc::now(),
	time_deleted: None,
	instance_id: instance_id.into_untyped_uuid(),
	sled_id: SledUuid::new_v4().into(),
	propolis_ip: "10.1.9.42".parse().unwrap(),
	propolis_port: 420.into(),
	cpu_platform: VmmCpuPlatform::SledDefault,
	time_state_updated: Utc::now(),
	generation: Generation::new(),
	state: db::model::VmmState::Running,
	failure_reason: None,
	},
	)
	.await
	.expect("VMM 2 should be inserted successfully!");

Conversation

hawkw commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hawkw commented May 8, 2026 •

edited

Loading

hawkw May 15, 2026 •

edited

Loading