nexus: add failure reasons to Failed VMMs using a VmmState domain type#10425
Conversation
Failed VMMs using a VmmState domain type
hawkw
left a comment
There was a problem hiding this comment.
Some notes to hopefully make this loud and awkward PR a bit more comfortable to review. Also, I'm tagging in @sunshowers in the hopes that they will help to defend my decision to have three versions of these types in case anyone else gets mad at me for that :)
| self.outcome = CheckOutcome::Failure(Failure::SledExpunged); | ||
| return mk_failed(); | ||
| return mk_failed(VmmFailureReason::SledExpunged); |
There was a problem hiding this comment.
In future, I would like to replace the Failure type which is used internally to the instance_watcher task with just using VmmFailureReason --- I didn't do that in this PR because the diff has gotten big enough already, but will do that in a subsequent change if y'all decide you don't totally hate what I've proposed thus far.
| #[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)] | ||
| pub struct SledVmmState { | ||
| pub vmm_state: VmmRuntimeState, | ||
| /// The current state of any inbound migration to this VMM. | ||
| pub migration_in: Option<sled_agent::MigrationRuntimeState>, | ||
|
|
||
| /// The state of any outbound migration from this VMM. | ||
| pub migration_out: Option<sled_agent::MigrationRuntimeState>, | ||
| } |
There was a problem hiding this comment.
this is the "domain type" version of SledVmmState
| impl SledVmmState { | ||
| pub fn migrations(&self) -> Migrations<'_> { | ||
| Migrations { | ||
| migration_in: self.migration_in.as_ref(), | ||
| migration_out: self.migration_out.as_ref(), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This just moved from its previous home in sled_agent_types::instance and has not otherwise changed.
| /// The dynamic runtime properties of an individual VMM process. | ||
| #[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)] | ||
| pub struct VmmRuntimeState { | ||
| /// The last state reported by this VMM. | ||
| pub state: VmmState, | ||
| /// The generation number for this VMM's state. | ||
| #[serde(rename = "gen")] | ||
| pub generation: Generation, | ||
| /// Timestamp for the VMM's state. | ||
| pub time_updated: DateTime<Utc>, | ||
| } |
There was a problem hiding this comment.
This is the same as the VmmRuntimeState in sled_agent_types but using the domain type version of VmmState, which has a failure reason...
| /// Transitions the VMM state from the current state to `new_state`, | ||
| /// returning a `VmmRuntimeState` representing the VMM after transitioning | ||
| /// to the new state. | ||
| /// | ||
| /// If `new_state` is the same as the current state, returns a copy of | ||
| /// `self` without any changes. | ||
| pub fn transition(&self, new_state: VmmState) -> Self { | ||
| if new_state == self.state { | ||
| return self.clone(); | ||
| } | ||
| Self { | ||
| state: new_state, | ||
| generation: self.generation.next(), | ||
| time_updated: Utc::now(), | ||
| } | ||
| } |
There was a problem hiding this comment.
This is a helper for reducing the boilerplate of doing these state transitions, which I thought was nice --- especially because it enforces the property that you bump the generation when doing a state transition, which you previously had to remember to do.
| /// Returns `true` if this is a terminal VMM state. | ||
| /// | ||
| /// A VMM in a terminal state will never transition to another state. | ||
| pub fn is_terminal(&self) -> bool { | ||
| matches!(self, VmmState::Destroyed | VmmState::Failed(_)) | ||
| } | ||
|
|
||
| /// Returns `true` if the VMM is in a `Failed` state. | ||
| pub fn is_failed(&self) -> bool { | ||
| matches!(self, VmmState::Failed(_)) | ||
| } | ||
|
|
||
| /// Returns `true` if the VMM is in a state where it is safe to | ||
| /// deallocate its sled resources and mark it as deleted. | ||
| pub fn is_destroyable(&self) -> bool { | ||
| matches!( | ||
| self, | ||
| VmmState::Destroyed | VmmState::Failed(_) | VmmState::SagaUnwound | ||
| ) | ||
| } | ||
|
|
||
| /// Returns `true` if the VMM exists on a sled. | ||
| /// | ||
| /// This is `false` for VMMs that have not yet been created, have been | ||
| /// destroyed, or were produced by an unwound saga and will never exist. | ||
| pub fn exists_on_sled(&self) -> bool { | ||
| !matches!( | ||
| self, | ||
| VmmState::Creating | VmmState::SagaUnwound | VmmState::Destroyed | ||
| ) | ||
| } |
There was a problem hiding this comment.
These predicates previously existed in the nexus_db_model version of the type; I've moved them here.
| impl From<sled_agent::VmmState> for VmmState { | ||
| fn from(state: sled_agent::VmmState) -> Self { | ||
| match state { | ||
| sled_agent::VmmState::Starting => VmmState::Starting, | ||
| sled_agent::VmmState::Running => VmmState::Running, | ||
| sled_agent::VmmState::Stopping => VmmState::Stopping, | ||
| sled_agent::VmmState::Stopped => VmmState::Stopped, | ||
| sled_agent::VmmState::Rebooting => VmmState::Rebooting, | ||
| sled_agent::VmmState::Migrating => VmmState::Migrating, | ||
| sled_agent::VmmState::Failed => { | ||
| // If we are converting a state received from a sled-agent that | ||
| // indicates that the VMM is failed, the failure reason is | ||
| // implicitly "from_sled_agent" for now. | ||
| VmmState::Failed(VmmFailureReason::FromSledAgent) | ||
| } | ||
| sled_agent::VmmState::Destroyed => VmmState::Destroyed, | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This is where we now synthesize the FromSledAgent failure reason when receiv9ing things from the sled-agent.
| impl From<VmmState> for sled_agent::VmmState { | ||
| fn from(state: VmmState) -> Self { | ||
| match state { | ||
| // The `Creating` state is internal to Nexus; the outside world | ||
| // should treat it as equivalent to `Starting`. | ||
| VmmState::Starting | VmmState::Creating => { | ||
| sled_agent::VmmState::Starting | ||
| } | ||
| VmmState::Running => sled_agent::VmmState::Running, | ||
| VmmState::Stopping => sled_agent::VmmState::Stopping, | ||
| VmmState::Stopped => sled_agent::VmmState::Stopped, | ||
| VmmState::Rebooting => sled_agent::VmmState::Rebooting, | ||
| VmmState::Migrating => sled_agent::VmmState::Migrating, | ||
| VmmState::Failed(_) => sled_agent::VmmState::Failed, | ||
|
|
||
| // The `SagaUnwound` state is internal to Nexus; the outside world | ||
| // should treat it as equivalent to `Destroyed`. | ||
| VmmState::Destroyed | VmmState::SagaUnwound => { | ||
| sled_agent::VmmState::Destroyed | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This conversion already existed but was defined on the nexus_db_model type rather than here.
| impl From<VmmState> for omicron_common::api::external::InstanceState { | ||
| fn from(value: VmmState) -> Self { | ||
| use omicron_common::api::external::InstanceState as Output; | ||
|
|
||
| match value { | ||
| // An instance with a VMM which is in the `Creating` state maps to | ||
| // `InstanceState::Starting`, rather than `InstanceState::Creating`. | ||
| // If we are still creating the VMM, this is because we are | ||
| // attempting to *start* the instance; instances may be created | ||
| // without creating a VMM to run them, and then started later. | ||
| VmmState::Creating | VmmState::Starting => Output::Starting, | ||
| VmmState::Running => Output::Running, | ||
| VmmState::Stopping => Output::Stopping, | ||
| // `SagaUnwound` should map to `Stopped` so that an `instance_view` | ||
| // API call that produces an instance with an unwound VMM will appear to | ||
| // be `Stopped`. This is because instances with unwound VMMs can | ||
| // be started by a subsequent instance-start saga, just like | ||
| // instances whose internal state actually is `Stopped`. | ||
| VmmState::Stopped | VmmState::SagaUnwound => Output::Stopped, | ||
| VmmState::Rebooting => Output::Rebooting, | ||
| VmmState::Migrating => Output::Migrating, | ||
| VmmState::Failed(_) => Output::Failed, | ||
| VmmState::Destroyed => Output::Destroyed, | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This conversion already existed but was defined on the nexus_db_model type.
| #[derive(Copy, Clone, Debug, Default)] | ||
| pub struct Migrations<'state> { | ||
| pub migration_in: Option<&'state sled_agent::MigrationRuntimeState>, | ||
| pub migration_out: Option<&'state sled_agent::MigrationRuntimeState>, | ||
| } | ||
|
|
||
| impl Migrations<'_> { | ||
| pub fn empty() -> Self { | ||
| Self { migration_in: None, migration_out: None } | ||
| } | ||
| } |
There was a problem hiding this comment.
This used to be defined in sled-agent-types but wasn't actually an API type at all, it just lived in the same module as the API types. It lives here now.
this is slightly sad because i prefered the `NoSuchVmm` naming, but would like the `IntoStaticStr` to remain `no_such_instance` so that a subsequent change to make the reason types in `instance_watcher`'s metrics use the new `VmmFailureReason` type, and it already uses this string, and i don't wanna change that...
| if state == &db::model::VmmState::Failed { | ||
| println!( | ||
| "{indent}{FAILURE_NOTE:>width$}: {}", | ||
| reason.description() |
There was a problem hiding this comment.
Is this description going to be different from the to_string() impl above (on line 8021)?
There was a problem hiding this comment.
description() provides a human-readable sentence describing what this reason means, while the to_string() impl returns the same string used as the database enum value. I wanted to include both here, so that you can see both the actual value in the DB and a more helpful explanation of what that means.
| * A request to the sled-agent received a response indicating that this | ||
| * VMM is no longer present on the sled. |
There was a problem hiding this comment.
This is the case which would happen on sled agent rebooting (or the whole sled rebooting), correct?
Is it worth calling that out explicitly? Are we aware of other situations where "no such instance" gets returned? I guess like "slow request got superseded by a different nexus"
I mostly bring this up to distinguish from the "sled off" and "from sled agent" cases
There was a problem hiding this comment.
Is it worth calling that out explicitly? Are we aware of other situations where "no such instance" gets returned?
It's returned any time that the sled-agent restarts, whether due to the sled rebooting or a panic.
I guess like "slow request got superseded by a different nexus"
I don't think we would ever see this due to a race with another Nexus trying to stop an instance or anything like that. Because the sled-agent handles all instance requests from nexus serially within that instance, we will have either already written the Destroyed state to the database and therefore never write a Failed state to the db (because the VMM's generation has been incremented), or we will drain all other inflight requests to get the VMM's state when it is destroyed and respond with the destroyed state. This really only happens if sled agent has restarted (or if there's a very werid sled-agent bug I guess).
| impl VmmFailureReason { | ||
| pub fn description(&self) -> &'static str { | ||
| match self { | ||
| Self::Prehistoric => "<VMM failed prior to recorded history>", |
There was a problem hiding this comment.
nit: Are the <> characters actually useful here
| Eq, | ||
| Serialize, | ||
| Deserialize, | ||
| strum::Display, |
There was a problem hiding this comment.
Should we use:
#[strum(to_string = "reason")]And drop description? idk. Seems a bit weird to have both
There was a problem hiding this comment.
I don't really understand this suggestion --- the Display impl is intended to format the concise name of the failure reason, which matches the value used in the db enum and (eventually) the instance_watcher metrics labels, which (as I noted in #10425 (comment)) have not yet been replaced with this enum, but which I would like to replace with the VmmFailureReason type in a subsequent PR. I didn't do that here, because there is a bunch of refactoring I would like to do in that task, and I didn't want to add even more code to this PR.
There was a problem hiding this comment.
If what you're suggesting here is that the Display impl could print the human readable sentence and the db-enum-like short label could just be returned by a separate label() or as_str() or something, I'm open to changing it. We do need both, but I don't have a very strong opinion about which one is the Display impl.
There was a problem hiding this comment.
eh, it's fine, if you're trying to use them as "labels vs full descriptions" I think it's fine as-is. I brought it up because I thought "we might only care about full descriptions", but it sounds like you have a valid use-case for the labels as distinct things too
| "error" => ?inner, | ||
| "reason" => %reason); | ||
|
|
||
| if let Err(set_failed_error) = osagactx |
There was a problem hiding this comment.
Compared to the old code - which only called mark_vmm_failed when inner.vmm_gone() was true - is it intentional that we do this marking for "any vmm failure reason"?
There was a problem hiding this comment.
Yes, this is intentional. Presently, the behavior here is identical to the previous behavior: SledAgentInstanceError::vmm_failure_reason only returns Some(VmmFailureReason::NoSuchInstance) if vmm_gone() returns true:
omicron/nexus/src/app/instance.rs
Lines 193 to 201 in dc07899
The reason I changed this to optionally return any failure reason is that I wanted to set up for a subsequent change to make the code which attempts to tell the sled-agent to do a state change follow up on CommunicationErrors by checking if the sled is on, similarly to how instance_watcher now does. If we make that subsequent change, I anticipate changing SledAgentInstanceError to also be able to return the SledOff variant.
| generation: Generation(vmm2.generation.0.next()), | ||
| &nexus_types::instance::VmmRuntimeState { | ||
| time_updated: Utc::now(), | ||
| generation: vmm2.generation.0.next().next(), |
There was a problem hiding this comment.
Why call next() twice here?
There was a problem hiding this comment.
The previous code was not accurately simulating the actual behavior. vmm2 was the original VMM inserted into the db (generation 1):
omicron/nexus/db-queries/src/db/datastore/vmm.rs
Lines 531 to 550 in dc07899
Subsequently, we insert a new state to transition it to Running, bumping the generation to 2:
So here we have to bump the generation twice for the insert to apply the next state.
I think this could be refactored a bit so that we store the updated state when we transition it to running, and just transition it again, which might be neater?
There was a problem hiding this comment.
eh, I'll defer that to you! Just figured I'd flag it, because it seemed "mostly unrelated to the rest of this PR" but still like a behavioral change
Okay, so this is a large and unpleasant change, and I apologize from the bottom of my heart to those of you who are being forced to review it. Nonetheless, I do feel that this is justified and perhaps the best of the many bad solutions to the problem. And, I think that the actual change is not ultimately that complex, despite the number of files which had to be touched.
In summary, this is an attempt to solve the same problem as #10285: when a VMM is moved to the
Failedstate, it sure would be nice if we could write down why that happened somewhere where it is possible to see that information ever again. Unfortunately, that change ran afoul of the present cyclic dependency between the Nexus internal and sled-agent APIs: we cannot easily add additional data to theSledVmmStatetype insled_agent_types, because it is part of both the server-side versioned sled-agent API and the client-side versioned Nexus API. My previous attempt to resolve this stickiness, #10303, was reasonably deemed somewhat too sketchy, and rather than moving forwards with that, I think we are better off waiting for client-side versioning to be a thing which is possible.Therefore, this change punts on including more specific failure details for failed VMMs reported by sled-agent. Instead, when the sled-agent reports that a VMM has failed, we just synthesize a
VmmFailureReason::FromSledAgent. When the failure is detected by Nexus without the sled-agent reporting the failure, we record one of several other failure reasons. This is implemented by adding a new layer of type indirection between the sled-agent API types and the DB model types: anexus_types::instance::SledVmmState. This way, when Nexus would like to internally record a failed VMM state, it can provide a failure reason, and at the boundary, when a failed state is received from a sled-agent, theFrom<sled_agent_types::instance::VmmInstanceState> for nexus_types::instance::SledVmmStatecan just fill in the failure reason.The other thing that this indirection gives us, which is either good or bad (?), is that we can make the
nexus_types::instance::VmmStateenum have aVmmFailureReasonin itsFailedstate, so the shape of the domain type enforces that there is a failure reason if the VMM is failed, and there is not a failure reason otherwise. Since these are separate columns in the database, the DB model type can represent a bunch of invalid states, which I didn't love. Unfortunately, having an additional domain type forVmmStateis part of why this change is so massive, as it requires an additional layer of type conversions and such, and results in touching way more files. I could be convinced this is not the right thing, but I still kind of loosely think it's probably worth it.Most of this change is just updating everything to use the new types.
Closes #10285