Skip to content

nexus: add failure reasons to Failed VMMs using a VmmState domain type#10425

Merged
hawkw merged 16 commits into
mainfrom
eliza/failure-reason-2
May 16, 2026
Merged

nexus: add failure reasons to Failed VMMs using a VmmState domain type#10425
hawkw merged 16 commits into
mainfrom
eliza/failure-reason-2

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented May 8, 2026

Okay, so this is a large and unpleasant change, and I apologize from the bottom of my heart to those of you who are being forced to review it. Nonetheless, I do feel that this is justified and perhaps the best of the many bad solutions to the problem. And, I think that the actual change is not ultimately that complex, despite the number of files which had to be touched.

In summary, this is an attempt to solve the same problem as #10285: when a VMM is moved to the Failed state, it sure would be nice if we could write down why that happened somewhere where it is possible to see that information ever again. Unfortunately, that change ran afoul of the present cyclic dependency between the Nexus internal and sled-agent APIs: we cannot easily add additional data to the SledVmmState type in sled_agent_types, because it is part of both the server-side versioned sled-agent API and the client-side versioned Nexus API. My previous attempt to resolve this stickiness, #10303, was reasonably deemed somewhat too sketchy, and rather than moving forwards with that, I think we are better off waiting for client-side versioning to be a thing which is possible.

Therefore, this change punts on including more specific failure details for failed VMMs reported by sled-agent. Instead, when the sled-agent reports that a VMM has failed, we just synthesize a VmmFailureReason::FromSledAgent. When the failure is detected by Nexus without the sled-agent reporting the failure, we record one of several other failure reasons. This is implemented by adding a new layer of type indirection between the sled-agent API types and the DB model types: a nexus_types::instance::SledVmmState. This way, when Nexus would like to internally record a failed VMM state, it can provide a failure reason, and at the boundary, when a failed state is received from a sled-agent, the From<sled_agent_types::instance::VmmInstanceState> for nexus_types::instance::SledVmmState can just fill in the failure reason.

The other thing that this indirection gives us, which is either good or bad (?), is that we can make the nexus_types::instance::VmmState enum have a VmmFailureReason in its Failed state, so the shape of the domain type enforces that there is a failure reason if the VMM is failed, and there is not a failure reason otherwise. Since these are separate columns in the database, the DB model type can represent a bunch of invalid states, which I didn't love. Unfortunately, having an additional domain type for VmmState is part of why this change is so massive, as it requires an additional layer of type conversions and such, and results in touching way more files. I could be convinced this is not the right thing, but I still kind of loosely think it's probably worth it.

Most of this change is just updating everything to use the new types.

Closes #10285

@hawkw hawkw changed the title [WIP] GIANT HORRIBLE VMM STATE REFACTOR nexus: add failure reasons to Failed VMMs using a VmmState domain type May 11, 2026
@hawkw hawkw marked this pull request as ready for review May 11, 2026 22:12
Copy link
Copy Markdown
Member Author

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes to hopefully make this loud and awkward PR a bit more comfortable to review. Also, I'm tagging in @sunshowers in the hopes that they will help to defend my decision to have three versions of these types in case anyone else gets mad at me for that :)

Comment on lines 227 to +228
self.outcome = CheckOutcome::Failure(Failure::SledExpunged);
return mk_failed();
return mk_failed(VmmFailureReason::SledExpunged);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future, I would like to replace the Failure type which is used internally to the instance_watcher task with just using VmmFailureReason --- I didn't do that in this PR because the diff has gotten big enough already, but will do that in a subsequent change if y'all decide you don't totally hate what I've proposed thus far.

Comment on lines +11 to +19
#[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)]
pub struct SledVmmState {
pub vmm_state: VmmRuntimeState,
/// The current state of any inbound migration to this VMM.
pub migration_in: Option<sled_agent::MigrationRuntimeState>,

/// The state of any outbound migration from this VMM.
pub migration_out: Option<sled_agent::MigrationRuntimeState>,
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the "domain type" version of SledVmmState

Comment on lines +29 to +36
impl SledVmmState {
pub fn migrations(&self) -> Migrations<'_> {
Migrations {
migration_in: self.migration_in.as_ref(),
migration_out: self.migration_out.as_ref(),
}
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just moved from its previous home in sled_agent_types::instance and has not otherwise changed.

Comment on lines +38 to +48
/// The dynamic runtime properties of an individual VMM process.
#[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)]
pub struct VmmRuntimeState {
/// The last state reported by this VMM.
pub state: VmmState,
/// The generation number for this VMM's state.
#[serde(rename = "gen")]
pub generation: Generation,
/// Timestamp for the VMM's state.
pub time_updated: DateTime<Utc>,
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as the VmmRuntimeState in sled_agent_types but using the domain type version of VmmState, which has a failure reason...

Comment on lines +51 to +66
/// Transitions the VMM state from the current state to `new_state`,
/// returning a `VmmRuntimeState` representing the VMM after transitioning
/// to the new state.
///
/// If `new_state` is the same as the current state, returns a copy of
/// `self` without any changes.
pub fn transition(&self, new_state: VmmState) -> Self {
if new_state == self.state {
return self.clone();
}
Self {
state: new_state,
generation: self.generation.next(),
time_updated: Utc::now(),
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a helper for reducing the boilerplate of doing these state transitions, which I thought was nice --- especially because it enforces the property that you bump the generation when doing a state transition, which you previously had to remember to do.

Comment on lines +128 to +158
/// Returns `true` if this is a terminal VMM state.
///
/// A VMM in a terminal state will never transition to another state.
pub fn is_terminal(&self) -> bool {
matches!(self, VmmState::Destroyed | VmmState::Failed(_))
}

/// Returns `true` if the VMM is in a `Failed` state.
pub fn is_failed(&self) -> bool {
matches!(self, VmmState::Failed(_))
}

/// Returns `true` if the VMM is in a state where it is safe to
/// deallocate its sled resources and mark it as deleted.
pub fn is_destroyable(&self) -> bool {
matches!(
self,
VmmState::Destroyed | VmmState::Failed(_) | VmmState::SagaUnwound
)
}

/// Returns `true` if the VMM exists on a sled.
///
/// This is `false` for VMMs that have not yet been created, have been
/// destroyed, or were produced by an unwound saga and will never exist.
pub fn exists_on_sled(&self) -> bool {
!matches!(
self,
VmmState::Creating | VmmState::SagaUnwound | VmmState::Destroyed
)
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These predicates previously existed in the nexus_db_model version of the type; I've moved them here.

Comment on lines +167 to +185
impl From<sled_agent::VmmState> for VmmState {
fn from(state: sled_agent::VmmState) -> Self {
match state {
sled_agent::VmmState::Starting => VmmState::Starting,
sled_agent::VmmState::Running => VmmState::Running,
sled_agent::VmmState::Stopping => VmmState::Stopping,
sled_agent::VmmState::Stopped => VmmState::Stopped,
sled_agent::VmmState::Rebooting => VmmState::Rebooting,
sled_agent::VmmState::Migrating => VmmState::Migrating,
sled_agent::VmmState::Failed => {
// If we are converting a state received from a sled-agent that
// indicates that the VMM is failed, the failure reason is
// implicitly "from_sled_agent" for now.
VmmState::Failed(VmmFailureReason::FromSledAgent)
}
sled_agent::VmmState::Destroyed => VmmState::Destroyed,
}
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we now synthesize the FromSledAgent failure reason when receiv9ing things from the sled-agent.

Comment on lines +187 to +209
impl From<VmmState> for sled_agent::VmmState {
fn from(state: VmmState) -> Self {
match state {
// The `Creating` state is internal to Nexus; the outside world
// should treat it as equivalent to `Starting`.
VmmState::Starting | VmmState::Creating => {
sled_agent::VmmState::Starting
}
VmmState::Running => sled_agent::VmmState::Running,
VmmState::Stopping => sled_agent::VmmState::Stopping,
VmmState::Stopped => sled_agent::VmmState::Stopped,
VmmState::Rebooting => sled_agent::VmmState::Rebooting,
VmmState::Migrating => sled_agent::VmmState::Migrating,
VmmState::Failed(_) => sled_agent::VmmState::Failed,

// The `SagaUnwound` state is internal to Nexus; the outside world
// should treat it as equivalent to `Destroyed`.
VmmState::Destroyed | VmmState::SagaUnwound => {
sled_agent::VmmState::Destroyed
}
}
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion already existed but was defined on the nexus_db_model type rather than here.

Comment on lines +211 to +236
impl From<VmmState> for omicron_common::api::external::InstanceState {
fn from(value: VmmState) -> Self {
use omicron_common::api::external::InstanceState as Output;

match value {
// An instance with a VMM which is in the `Creating` state maps to
// `InstanceState::Starting`, rather than `InstanceState::Creating`.
// If we are still creating the VMM, this is because we are
// attempting to *start* the instance; instances may be created
// without creating a VMM to run them, and then started later.
VmmState::Creating | VmmState::Starting => Output::Starting,
VmmState::Running => Output::Running,
VmmState::Stopping => Output::Stopping,
// `SagaUnwound` should map to `Stopped` so that an `instance_view`
// API call that produces an instance with an unwound VMM will appear to
// be `Stopped`. This is because instances with unwound VMMs can
// be started by a subsequent instance-start saga, just like
// instances whose internal state actually is `Stopped`.
VmmState::Stopped | VmmState::SagaUnwound => Output::Stopped,
VmmState::Rebooting => Output::Rebooting,
VmmState::Migrating => Output::Migrating,
VmmState::Failed(_) => Output::Failed,
VmmState::Destroyed => Output::Destroyed,
}
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion already existed but was defined on the nexus_db_model type.

Comment on lines +280 to +290
#[derive(Copy, Clone, Debug, Default)]
pub struct Migrations<'state> {
pub migration_in: Option<&'state sled_agent::MigrationRuntimeState>,
pub migration_out: Option<&'state sled_agent::MigrationRuntimeState>,
}

impl Migrations<'_> {
pub fn empty() -> Self {
Self { migration_in: None, migration_out: None }
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be defined in sled-agent-types but wasn't actually an API type at all, it just lived in the same module as the API types. It lives here now.

@hawkw hawkw requested review from iximeow and smklein May 11, 2026 22:26
@hawkw hawkw self-assigned this May 11, 2026
hawkw added 4 commits May 11, 2026 15:28
this is slightly sad because i prefered the `NoSuchVmm` naming, but
would like the `IntoStaticStr` to remain `no_such_instance` so that a
subsequent change to make the reason types in `instance_watcher`'s
metrics use the new `VmmFailureReason` type, and it already uses this
string, and i don't wanna change that...
if state == &db::model::VmmState::Failed {
println!(
"{indent}{FAILURE_NOTE:>width$}: {}",
reason.description()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this description going to be different from the to_string() impl above (on line 8021)?

Copy link
Copy Markdown
Member Author

@hawkw hawkw May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description() provides a human-readable sentence describing what this reason means, while the to_string() impl returns the same string used as the database enum value. I wanted to include both here, so that you can see both the actual value in the DB and a more helpful explanation of what that means.

Comment thread schema/crdb/dbinit.sql
Comment on lines +6041 to +6042
* A request to the sled-agent received a response indicating that this
* VMM is no longer present on the sled.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case which would happen on sled agent rebooting (or the whole sled rebooting), correct?

Is it worth calling that out explicitly? Are we aware of other situations where "no such instance" gets returned? I guess like "slow request got superseded by a different nexus"

I mostly bring this up to distinguish from the "sled off" and "from sled agent" cases

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth calling that out explicitly? Are we aware of other situations where "no such instance" gets returned?

It's returned any time that the sled-agent restarts, whether due to the sled rebooting or a panic.

I guess like "slow request got superseded by a different nexus"

I don't think we would ever see this due to a race with another Nexus trying to stop an instance or anything like that. Because the sled-agent handles all instance requests from nexus serially within that instance, we will have either already written the Destroyed state to the database and therefore never write a Failed state to the db (because the VMM's generation has been incremented), or we will drain all other inflight requests to get the VMM's state when it is destroyed and respond with the destroyed state. This really only happens if sled agent has restarted (or if there's a very werid sled-agent bug I guess).

Comment thread nexus/types/src/instance.rs Outdated
impl VmmFailureReason {
pub fn description(&self) -> &'static str {
match self {
Self::Prehistoric => "<VMM failed prior to recorded history>",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Are the <> characters actually useful here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not!

Eq,
Serialize,
Deserialize,
strum::Display,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use:

    #[strum(to_string = "reason")]

And drop description? idk. Seems a bit weird to have both

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this suggestion --- the Display impl is intended to format the concise name of the failure reason, which matches the value used in the db enum and (eventually) the instance_watcher metrics labels, which (as I noted in #10425 (comment)) have not yet been replaced with this enum, but which I would like to replace with the VmmFailureReason type in a subsequent PR. I didn't do that here, because there is a bunch of refactoring I would like to do in that task, and I didn't want to add even more code to this PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If what you're suggesting here is that the Display impl could print the human readable sentence and the db-enum-like short label could just be returned by a separate label() or as_str() or something, I'm open to changing it. We do need both, but I don't have a very strong opinion about which one is the Display impl.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, it's fine, if you're trying to use them as "labels vs full descriptions" I think it's fine as-is. I brought it up because I thought "we might only care about full descriptions", but it sounds like you have a valid use-case for the labels as distinct things too

"error" => ?inner,
"reason" => %reason);

if let Err(set_failed_error) = osagactx
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the old code - which only called mark_vmm_failed when inner.vmm_gone() was true - is it intentional that we do this marking for "any vmm failure reason"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional. Presently, the behavior here is identical to the previous behavior: SledAgentInstanceError::vmm_failure_reason only returns Some(VmmFailureReason::NoSuchInstance) if vmm_gone() returns true:

/// If this error indicates that the VMM should be marked as `Failed`,
/// returns the reason for the failure.
pub fn vmm_failure_reason(&self) -> Option<db::model::VmmFailureReason> {
if self.vmm_gone() {
Some(db::model::VmmFailureReason::NoSuchInstance)
} else {
None
}
}

The reason I changed this to optionally return any failure reason is that I wanted to set up for a subsequent change to make the code which attempts to tell the sled-agent to do a state change follow up on CommunicationErrors by checking if the sled is on, similarly to how instance_watcher now does. If we make that subsequent change, I anticipate changing SledAgentInstanceError to also be able to return the SledOff variant.

generation: Generation(vmm2.generation.0.next()),
&nexus_types::instance::VmmRuntimeState {
time_updated: Utc::now(),
generation: vmm2.generation.0.next().next(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call next() twice here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code was not accurately simulating the actual behavior. vmm2 was the original VMM inserted into the db (generation 1):

let vmm2 = datastore
.vmm_insert(
&opctx,
Vmm {
id: Uuid::new_v4(),
time_created: Utc::now(),
time_deleted: None,
instance_id: instance_id.into_untyped_uuid(),
sled_id: SledUuid::new_v4().into(),
propolis_ip: "10.1.9.42".parse().unwrap(),
propolis_port: 420.into(),
cpu_platform: VmmCpuPlatform::SledDefault,
time_state_updated: Utc::now(),
generation: Generation::new(),
state: db::model::VmmState::Running,
failure_reason: None,
},
)
.await
.expect("VMM 2 should be inserted successfully!");

Subsequently, we insert a new state to transition it to Running, bumping the generation to 2:

&vmm2.runtime().transition(VmmState::Running),

So here we have to bump the generation twice for the insert to apply the next state.

I think this could be refactored a bit so that we store the updated state when we transition it to running, and just transition it again, which might be neater?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, I'll defer that to you! Just figured I'd flag it, because it seemed "mostly unrelated to the rest of this PR" but still like a behavioral change

@hawkw hawkw enabled auto-merge (squash) May 15, 2026 19:19
@hawkw hawkw merged commit ebeb706 into main May 16, 2026
16 checks passed
@hawkw hawkw deleted the eliza/failure-reason-2 branch May 16, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants