diff --git a/src/current/v23.2/monitoring-and-alerting.md b/src/current/v23.2/monitoring-and-alerting.md index f80cc618eb3..cad260001e5 100644 --- a/src/current/v23.2/monitoring-and-alerting.md +++ b/src/current/v23.2/monitoring-and-alerting.md @@ -110,14 +110,20 @@ The `http://:/health?ready=1` endpoint returns an HTTP `50 If you find that your load balancer's health check is not always recognizing a node as unready before the node shuts down, you can increase the `server.shutdown.initial_wait` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) (previously named `server.shutdown.drain_wait`) to cause a node to return `503 Service Unavailable` even before it has started shutting down. {{site.data.alerts.end}} +- The node is [decommissioning or decommissioned]({% link {{ page.version.version }}/node-shutdown.md %}?filters=decommission#decommissioning). This causes load balancers and connection managers to reroute traffic to other nodes while replicas are rebalanced away from the node. + - The node is unable to communicate with a majority of the other nodes in the cluster, likely because the cluster is unavailable due to too many nodes being down. {% include_cached copy-clipboard.html %} ~~~ shell -$ curl http://localhost:8080/health?ready=1 +$ curl -i http://localhost:8080/health?ready=1 ~~~ +The `-i` flag includes the HTTP response status in the `curl` output. Without `-i`, `curl` prints only the response body by default. + ~~~ +HTTP/1.1 503 Service Unavailable + { "error": "node is not healthy", "code": 14, diff --git a/src/current/v23.2/node-shutdown.md b/src/current/v23.2/node-shutdown.md index 1571c2da5d5..676ca5ab426 100644 --- a/src/current/v23.2/node-shutdown.md +++ b/src/current/v23.2/node-shutdown.md @@ -51,7 +51,7 @@ An operator [initiates the decommissioning process](#decommission-the-node) on t The node's [`is_decommissioning`]({% link {{ page.version.version }}/cockroach-node.md %}#node-status) field is set to `true` and its `membership` status is set to `decommissioning`, which causes its replicas to be rebalanced to other nodes. If the rebalancing stalls during decommissioning, replicas that have yet to move are printed to the [SQL shell]({% link {{ page.version.version }}/cockroach-sql.md %}) and written to the [`OPS` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). [By default]({% link {{ page.version.version }}/configure-logs.md %}#default-logging-configuration), the `OPS` channel logs output to a `cockroach.log` file. -The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) continues to consider the node "ready" so that the node can function as a gateway to route SQL client connections to relevant data. +The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) returns an HTTP `503 Service Unavailable` response code with a JSON error response so that load balancers and connection managers stop directing new SQL client connections to the node while replicas are rebalanced. {{site.data.alerts.callout_info}} After this stage, the node is automatically drained. However, to avoid possible disruptions in query performance, you can manually drain the node before decommissioning. For more information, see [Perform node shutdown](#perform-node-shutdown). @@ -160,7 +160,7 @@ Before you [perform node shutdown](#perform-node-shutdown), review the following ### Load balancing -Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from draining nodes. +Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from nodes that are not ready to receive requests. To handle node shutdown effectively, the load balancer must be given enough time by the [`server.shutdown.initial_wait` duration](#server-shutdown-initial_wait). diff --git a/src/current/v24.1/monitoring-and-alerting.md b/src/current/v24.1/monitoring-and-alerting.md index 53224596aec..f6be5ec5be2 100644 --- a/src/current/v24.1/monitoring-and-alerting.md +++ b/src/current/v24.1/monitoring-and-alerting.md @@ -108,14 +108,20 @@ The `http://:/health?ready=1` endpoint returns an HTTP `50 If you find that your load balancer's health check is not always recognizing a node as unready before the node shuts down, you can increase the `server.shutdown.initial_wait` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) (previously named `server.shutdown.drain_wait`) to cause a node to return `503 Service Unavailable` even before it has started shutting down. {{site.data.alerts.end}} +- The node is [decommissioning or decommissioned]({% link {{ page.version.version }}/node-shutdown.md %}?filters=decommission#decommissioning). This causes load balancers and connection managers to reroute traffic to other nodes while replicas are rebalanced away from the node. + - The node is unable to communicate with a majority of the other nodes in the cluster, likely because the cluster is unavailable due to too many nodes being down. {% include_cached copy-clipboard.html %} ~~~ shell -$ curl http://localhost:8080/health?ready=1 +$ curl -i http://localhost:8080/health?ready=1 ~~~ +The `-i` flag includes the HTTP response status in the `curl` output. Without `-i`, `curl` prints only the response body by default. + ~~~ +HTTP/1.1 503 Service Unavailable + { "error": "node is not healthy", "code": 14, diff --git a/src/current/v24.1/node-shutdown.md b/src/current/v24.1/node-shutdown.md index 65567993f27..f65f8aeef26 100644 --- a/src/current/v24.1/node-shutdown.md +++ b/src/current/v24.1/node-shutdown.md @@ -51,7 +51,7 @@ An operator [initiates the decommissioning process](#decommission-the-node) on t The node's [`is_decommissioning`]({% link {{ page.version.version }}/cockroach-node.md %}#node-status) field is set to `true` and its `membership` status is set to `decommissioning`, which causes its replicas to be rebalanced to other nodes. If the rebalancing stalls during decommissioning, replicas that have yet to move are printed to the [SQL shell]({% link {{ page.version.version }}/cockroach-sql.md %}) and written to the [`OPS` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). [By default]({% link {{ page.version.version }}/configure-logs.md %}#default-logging-configuration), the `OPS` channel logs output to a `cockroach.log` file. -The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) continues to consider the node "ready" so that the node can function as a gateway to route SQL client connections to relevant data. +The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) returns an HTTP `503 Service Unavailable` response code with a JSON error response so that load balancers and connection managers stop directing new SQL client connections to the node while replicas are rebalanced. {{site.data.alerts.callout_info}} After this stage, the node is automatically drained. However, to avoid possible disruptions in query performance, you can manually drain the node before decommissioning. For more information, see [Perform node shutdown](#perform-node-shutdown). @@ -160,7 +160,7 @@ Before you [perform node shutdown](#perform-node-shutdown), review the following ### Load balancing -Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from draining nodes. +Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from nodes that are not ready to receive requests. To handle node shutdown effectively, the load balancer must be given enough time by the [`server.shutdown.initial_wait` duration](#server-shutdown-initial_wait). diff --git a/src/current/v24.3/monitoring-and-alerting.md b/src/current/v24.3/monitoring-and-alerting.md index 126ea495e7e..25436967414 100644 --- a/src/current/v24.3/monitoring-and-alerting.md +++ b/src/current/v24.3/monitoring-and-alerting.md @@ -108,14 +108,20 @@ The `http://:/health?ready=1` endpoint returns an HTTP `50 If you find that your load balancer's health check is not always recognizing a node as unready before the node shuts down, you can increase the `server.shutdown.initial_wait` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) (previously named `server.shutdown.drain_wait`) to cause a node to return `503 Service Unavailable` even before it has started shutting down. {{site.data.alerts.end}} +- The node is [decommissioning or decommissioned]({% link {{ page.version.version }}/node-shutdown.md %}?filters=decommission#decommissioning). This causes load balancers and connection managers to reroute traffic to other nodes while replicas are rebalanced away from the node. + - The node is unable to communicate with a majority of the other nodes in the cluster, likely because the cluster is unavailable due to too many nodes being down. {% include_cached copy-clipboard.html %} ~~~ shell -$ curl http://localhost:8080/health?ready=1 +$ curl -i http://localhost:8080/health?ready=1 ~~~ +The `-i` flag includes the HTTP response status in the `curl` output. Without `-i`, `curl` prints only the response body by default. + ~~~ +HTTP/1.1 503 Service Unavailable + { "error": "node is not healthy", "code": 14, diff --git a/src/current/v24.3/node-shutdown.md b/src/current/v24.3/node-shutdown.md index fd1c056e3eb..42b8930f805 100644 --- a/src/current/v24.3/node-shutdown.md +++ b/src/current/v24.3/node-shutdown.md @@ -51,7 +51,7 @@ An operator [initiates the decommissioning process](#decommission-the-node) on t The node's [`is_decommissioning`]({% link {{ page.version.version }}/cockroach-node.md %}#node-status) field is set to `true` and its `membership` status is set to `decommissioning`, which causes its replicas to be rebalanced to other nodes. If the rebalancing stalls during decommissioning, replicas that have yet to move are printed to the [SQL shell]({% link {{ page.version.version }}/cockroach-sql.md %}) and written to the [`OPS` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). [By default]({% link {{ page.version.version }}/configure-logs.md %}#default-logging-configuration), the `OPS` channel logs output to a `cockroach.log` file. -The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) continues to consider the node "ready" so that the node can function as a gateway to route SQL client connections to relevant data. +The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) returns an HTTP `503 Service Unavailable` response code with a JSON error response so that load balancers and connection managers stop directing new SQL client connections to the node while replicas are rebalanced. {{site.data.alerts.callout_info}} After this stage, the node is automatically drained. However, to avoid possible disruptions in query performance, you can manually drain the node before decommissioning. For more information, see [Perform node shutdown](#perform-node-shutdown). @@ -160,7 +160,7 @@ Before you [perform node shutdown](#perform-node-shutdown), review the following ### Load balancing -Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from draining nodes. +Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from nodes that are not ready to receive requests. To handle node shutdown effectively, the load balancer must be given enough time by the [`server.shutdown.initial_wait` duration](#server-shutdown-initial_wait). diff --git a/src/current/v25.2/monitoring-and-alerting.md b/src/current/v25.2/monitoring-and-alerting.md index 5a5816fb1b3..70d78a47add 100644 --- a/src/current/v25.2/monitoring-and-alerting.md +++ b/src/current/v25.2/monitoring-and-alerting.md @@ -108,14 +108,20 @@ The `http://:/health?ready=1` endpoint returns an HTTP `50 If you find that your load balancer's health check is not always recognizing a node as unready before the node shuts down, you can increase the `server.shutdown.initial_wait` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) (previously named `server.shutdown.drain_wait`) to cause a node to return `503 Service Unavailable` even before it has started shutting down. {{site.data.alerts.end}} +- The node is [decommissioning or decommissioned]({% link {{ page.version.version }}/node-shutdown.md %}?filters=decommission#decommissioning). This causes load balancers and connection managers to reroute traffic to other nodes while replicas are rebalanced away from the node. + - The node is unable to communicate with a majority of the other nodes in the cluster, likely because the cluster is unavailable due to too many nodes being down. {% include_cached copy-clipboard.html %} ~~~ shell -$ curl http://localhost:8080/health?ready=1 +$ curl -i http://localhost:8080/health?ready=1 ~~~ +The `-i` flag includes the HTTP response status in the `curl` output. Without `-i`, `curl` prints only the response body by default. + ~~~ +HTTP/1.1 503 Service Unavailable + { "error": "node is not healthy", "code": 14, @@ -143,7 +149,7 @@ The `/_status/vars` metrics endpoint is in Prometheus format and is not deprecat Several endpoints return raw status meta information in JSON at `http://:/#/debug`. You can investigate and use these endpoints, but note that they are subject to change. -Raw Status Endpoints +Raw Status Endpoints ### Node status command diff --git a/src/current/v25.2/node-shutdown.md b/src/current/v25.2/node-shutdown.md index ef95a6e9956..d6a4ca515ae 100644 --- a/src/current/v25.2/node-shutdown.md +++ b/src/current/v25.2/node-shutdown.md @@ -51,7 +51,7 @@ An operator [initiates the decommissioning process](#decommission-the-node) on t The node's [`is_decommissioning`]({% link {{ page.version.version }}/cockroach-node.md %}#node-status) field is set to `true` and its `membership` status is set to `decommissioning`, which causes its replicas to be rebalanced to other nodes. If the rebalancing stalls during decommissioning, replicas that have yet to move are printed to the [SQL shell]({% link {{ page.version.version }}/cockroach-sql.md %}) and written to the [`OPS` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). [By default]({% link {{ page.version.version }}/configure-logs.md %}#default-logging-configuration), the `OPS` channel logs output to a `cockroach.log` file. -The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) continues to consider the node "ready" so that the node can function as a gateway to route SQL client connections to relevant data. +The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) returns an HTTP `503 Service Unavailable` response code with a JSON error response so that load balancers and connection managers stop directing new SQL client connections to the node while replicas are rebalanced. {{site.data.alerts.callout_info}} After this stage, the node is automatically drained. However, to avoid possible disruptions in query performance, you can manually drain the node before decommissioning. For more information, see [Perform node shutdown](#perform-node-shutdown). @@ -160,7 +160,7 @@ Before you [perform node shutdown](#perform-node-shutdown), review the following ### Load balancing -Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from draining nodes. +Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from nodes that are not ready to receive requests. To handle node shutdown effectively, the load balancer must be given enough time by the [`server.shutdown.initial_wait` duration](#server-shutdown-initial_wait). diff --git a/src/current/v25.4/monitoring-and-alerting.md b/src/current/v25.4/monitoring-and-alerting.md index cc29283cd40..254666a9134 100644 --- a/src/current/v25.4/monitoring-and-alerting.md +++ b/src/current/v25.4/monitoring-and-alerting.md @@ -108,14 +108,20 @@ The `http://:/health?ready=1` endpoint returns an HTTP `50 If you find that your load balancer's health check is not always recognizing a node as unready before the node shuts down, you can increase the `server.shutdown.initial_wait` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) (previously named `server.shutdown.drain_wait`) to cause a node to return `503 Service Unavailable` even before it has started shutting down. {{site.data.alerts.end}} +- The node is [decommissioning or decommissioned]({% link {{ page.version.version }}/node-shutdown.md %}?filters=decommission#decommissioning). This causes load balancers and connection managers to reroute traffic to other nodes while replicas are rebalanced away from the node. + - The node is unable to communicate with a majority of the other nodes in the cluster, likely because the cluster is unavailable due to too many nodes being down. {% include_cached copy-clipboard.html %} ~~~ shell -$ curl http://localhost:8080/health?ready=1 +$ curl -i http://localhost:8080/health?ready=1 ~~~ +The `-i` flag includes the HTTP response status in the `curl` output. Without `-i`, `curl` prints only the response body by default. + ~~~ +HTTP/1.1 503 Service Unavailable + { "error": "node is not healthy", "code": 14, diff --git a/src/current/v25.4/node-shutdown.md b/src/current/v25.4/node-shutdown.md index 8b0067fe052..a3f46a8fe86 100644 --- a/src/current/v25.4/node-shutdown.md +++ b/src/current/v25.4/node-shutdown.md @@ -51,7 +51,7 @@ An operator [initiates the decommissioning process](#decommission-the-node) on t The node's [`is_decommissioning`]({% link {{ page.version.version }}/cockroach-node.md %}#node-status) field is set to `true` and its `membership` status is set to `decommissioning`, which causes its replicas to be rebalanced to other nodes. If the rebalancing stalls during decommissioning, replicas that have yet to move are printed to the [SQL shell]({% link {{ page.version.version }}/cockroach-sql.md %}) and written to the [`OPS` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). [By default]({% link {{ page.version.version }}/configure-logs.md %}#default-logging-configuration), the `OPS` channel logs output to a `cockroach.log` file. -The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) continues to consider the node "ready" so that the node can function as a gateway to route SQL client connections to relevant data. +The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) returns an HTTP `503 Service Unavailable` response code with a JSON error response so that load balancers and connection managers stop directing new SQL client connections to the node while replicas are rebalanced. {{site.data.alerts.callout_info}} After this stage, the node is automatically drained. However, to avoid possible disruptions in query performance, you can manually drain the node before decommissioning. For more information, see [Perform node shutdown](#perform-node-shutdown). @@ -160,7 +160,7 @@ Before you [perform node shutdown](#perform-node-shutdown), review the following ### Load balancing -Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from draining nodes. +Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from nodes that are not ready to receive requests. To handle node shutdown effectively, the load balancer must be given enough time by the [`server.shutdown.initial_wait` duration](#server-shutdown-initial_wait). diff --git a/src/current/v26.1/monitoring-and-alerting.md b/src/current/v26.1/monitoring-and-alerting.md index 46dfaf780ea..ac2bf56bbe5 100644 --- a/src/current/v26.1/monitoring-and-alerting.md +++ b/src/current/v26.1/monitoring-and-alerting.md @@ -108,14 +108,20 @@ The `http://:/health?ready=1` endpoint returns an HTTP `50 If you find that your load balancer's health check is not always recognizing a node as unready before the node shuts down, you can increase the `server.shutdown.initial_wait` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) (previously named `server.shutdown.drain_wait`) to cause a node to return `503 Service Unavailable` even before it has started shutting down. {{site.data.alerts.end}} +- The node is [decommissioning or decommissioned]({% link {{ page.version.version }}/node-shutdown.md %}?filters=decommission#decommissioning). This causes load balancers and connection managers to reroute traffic to other nodes while replicas are rebalanced away from the node. + - The node is unable to communicate with a majority of the other nodes in the cluster, likely because the cluster is unavailable due to too many nodes being down. {% include_cached copy-clipboard.html %} ~~~ shell -$ curl http://localhost:8080/health?ready=1 +$ curl -i http://localhost:8080/health?ready=1 ~~~ +The `-i` flag includes the HTTP response status in the `curl` output. Without `-i`, `curl` prints only the response body by default. + ~~~ +HTTP/1.1 503 Service Unavailable + { "error": "node is not healthy", "code": 14, diff --git a/src/current/v26.1/node-shutdown.md b/src/current/v26.1/node-shutdown.md index d8ef577783f..73c439d8db7 100644 --- a/src/current/v26.1/node-shutdown.md +++ b/src/current/v26.1/node-shutdown.md @@ -51,7 +51,7 @@ An operator [initiates the decommissioning process](#decommission-the-node) on t The node's [`is_decommissioning`]({% link {{ page.version.version }}/cockroach-node.md %}#node-status) field is set to `true` and its `membership` status is set to `decommissioning`, which causes its replicas to be rebalanced to other nodes. If the rebalancing stalls during decommissioning, replicas that have yet to move are printed to the [SQL shell]({% link {{ page.version.version }}/cockroach-sql.md %}) and written to the [`OPS` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). [By default]({% link {{ page.version.version }}/configure-logs.md %}#default-logging-configuration), the `OPS` channel logs output to a `cockroach.log` file. -The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) continues to consider the node "ready" so that the node can function as a gateway to route SQL client connections to relevant data. +The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) returns an HTTP `503 Service Unavailable` response code with a JSON error response so that load balancers and connection managers stop directing new SQL client connections to the node while replicas are rebalanced. {{site.data.alerts.callout_info}} After this stage, the node is automatically drained. However, to avoid possible disruptions in query performance, you can manually drain the node before decommissioning. For more information, see [Perform node shutdown](#perform-node-shutdown). @@ -160,7 +160,7 @@ Before you [perform node shutdown](#perform-node-shutdown), review the following ### Load balancing -Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from draining nodes. +Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from nodes that are not ready to receive requests. To handle node shutdown effectively, the load balancer must be given enough time by the [`server.shutdown.initial_wait` duration](#server-shutdown-initial_wait). diff --git a/src/current/v26.2/monitoring-and-alerting.md b/src/current/v26.2/monitoring-and-alerting.md index d6bd689cb1a..22b28fe56fa 100644 --- a/src/current/v26.2/monitoring-and-alerting.md +++ b/src/current/v26.2/monitoring-and-alerting.md @@ -120,14 +120,20 @@ The `http://:/health?ready=1` endpoint returns an HTTP `50 If you find that your load balancer's health check is not always recognizing a node as unready before the node shuts down, you can increase the `server.shutdown.initial_wait` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) (previously named `server.shutdown.drain_wait`) to cause a node to return `503 Service Unavailable` even before it has started shutting down. {{site.data.alerts.end}} +- The node is [decommissioning or decommissioned]({% link {{ page.version.version }}/node-shutdown.md %}?filters=decommission#decommissioning). This causes load balancers and connection managers to reroute traffic to other nodes while replicas are rebalanced away from the node. + - The node is unable to communicate with a majority of the other nodes in the cluster, likely because the cluster is unavailable due to too many nodes being down. {% include_cached copy-clipboard.html %} ~~~ shell -$ curl http://localhost:8080/health?ready=1 +$ curl -i http://localhost:8080/health?ready=1 ~~~ +The `-i` flag includes the HTTP response status in the `curl` output. Without `-i`, `curl` prints only the response body by default. + ~~~ +HTTP/1.1 503 Service Unavailable + { "error": "node is not healthy", "code": 14, diff --git a/src/current/v26.2/node-shutdown.md b/src/current/v26.2/node-shutdown.md index 7f18836c2a9..e14626f7553 100644 --- a/src/current/v26.2/node-shutdown.md +++ b/src/current/v26.2/node-shutdown.md @@ -51,7 +51,7 @@ An operator [initiates the decommissioning process](#decommission-the-node) on t The node's [`is_decommissioning`]({% link {{ page.version.version }}/cockroach-node.md %}#node-status) field is set to `true` and its `membership` status is set to `decommissioning`, which causes its replicas to be rebalanced to other nodes. If the rebalancing stalls during decommissioning, replicas that have yet to move are printed to the [SQL shell]({% link {{ page.version.version }}/cockroach-sql.md %}) and written to the [`OPS` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). [By default]({% link {{ page.version.version }}/configure-logs.md %}#default-logging-configuration), the `OPS` channel logs output to a `cockroach.log` file. -The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) continues to consider the node "ready" so that the node can function as a gateway to route SQL client connections to relevant data. +The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) returns an HTTP `503 Service Unavailable` response code with a JSON error response so that load balancers and connection managers stop directing new SQL client connections to the node while replicas are rebalanced. {{site.data.alerts.callout_info}} After this stage, the node is automatically drained. However, to avoid possible disruptions in query performance, you can manually drain the node before decommissioning. For more information, see [Perform node shutdown](#perform-node-shutdown). @@ -160,7 +160,7 @@ Before you [perform node shutdown](#perform-node-shutdown), review the following ### Load balancing -Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from draining nodes. +Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from nodes that are not ready to receive requests. To handle node shutdown effectively, the load balancer must be given enough time by the [`server.shutdown.initial_wait` duration](#server-shutdown-initial_wait). diff --git a/src/current/v26.3/monitoring-and-alerting.md b/src/current/v26.3/monitoring-and-alerting.md index 45bbb44de9d..9d87345fa7a 100644 --- a/src/current/v26.3/monitoring-and-alerting.md +++ b/src/current/v26.3/monitoring-and-alerting.md @@ -120,14 +120,20 @@ The `http://:/health?ready=1` endpoint returns an HTTP `50 If you find that your load balancer's health check is not always recognizing a node as unready before the node shuts down, you can increase the `server.shutdown.initial_wait` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) (previously named `server.shutdown.drain_wait`) to cause a node to return `503 Service Unavailable` even before it has started shutting down. {{site.data.alerts.end}} +- The node is [decommissioning or decommissioned]({% link {{ page.version.version }}/node-shutdown.md %}?filters=decommission#decommissioning). This causes load balancers and connection managers to reroute traffic to other nodes while replicas are rebalanced away from the node. + - The node is unable to communicate with a majority of the other nodes in the cluster, likely because the cluster is unavailable due to too many nodes being down. {% include_cached copy-clipboard.html %} ~~~ shell -$ curl http://localhost:8080/health?ready=1 +$ curl -i http://localhost:8080/health?ready=1 ~~~ +The `-i` flag includes the HTTP response status in the `curl` output. Without `-i`, `curl` prints only the response body by default. + ~~~ +HTTP/1.1 503 Service Unavailable + { "error": "node is not healthy", "code": 14, diff --git a/src/current/v26.3/node-shutdown.md b/src/current/v26.3/node-shutdown.md index 786090ba766..ce4757aceab 100644 --- a/src/current/v26.3/node-shutdown.md +++ b/src/current/v26.3/node-shutdown.md @@ -51,7 +51,7 @@ An operator [initiates the decommissioning process](#decommission-the-node) on t The node's [`is_decommissioning`]({% link {{ page.version.version }}/cockroach-node.md %}#node-status) field is set to `true` and its `membership` status is set to `decommissioning`, which causes its replicas to be rebalanced to other nodes. If the rebalancing stalls during decommissioning, replicas that have yet to move are printed to the [SQL shell]({% link {{ page.version.version }}/cockroach-sql.md %}) and written to the [`OPS` logging channel]({% link {{ page.version.version }}/logging-overview.md %}#logging-channels). [By default]({% link {{ page.version.version }}/configure-logs.md %}#default-logging-configuration), the `OPS` channel logs output to a `cockroach.log` file. -The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) continues to consider the node "ready" so that the node can function as a gateway to route SQL client connections to relevant data. +The node's [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) returns an HTTP `503 Service Unavailable` response code with a JSON error response so that load balancers and connection managers stop directing new SQL client connections to the node while replicas are rebalanced. {{site.data.alerts.callout_info}} After this stage, the node is automatically drained. However, to avoid possible disruptions in query performance, you can manually drain the node before decommissioning. For more information, see [Perform node shutdown](#perform-node-shutdown). @@ -160,7 +160,7 @@ Before you [perform node shutdown](#perform-node-shutdown), review the following ### Load balancing -Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from draining nodes. +Your [load balancer]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) should use the [`/health?ready=1` endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#health-ready-1) to actively monitor node health and direct SQL client connections away from nodes that are not ready to receive requests. To handle node shutdown effectively, the load balancer must be given enough time by the [`server.shutdown.initial_wait` duration](#server-shutdown-initial_wait).