Re-organize high availability cluster-based deployment docs#8812
Re-organize high availability cluster-based deployment docs#8812roberson-io wants to merge 5 commits into
Conversation
Restructure the HA deployment documentation to match admin workflow: - Add Preparation section with pre-deployment guidance - Add Deployment guide section with step-by-step instructions - Add Next steps section for scaling optimizations - Preserve Operations and maintenance section for advanced topics - Emphasize PostgreSQL hot_standby and hot_standby_feedback settings - Guide admins toward database configuration over config.json Closes #8811 Co-authored-by: Michael Roberson <roberson-io@users.noreply.github.com>
|
Newest code from mattermost has been published to preview environment for Git SHA 1bb1c44 |
|
Newest code from mattermost has been published to preview environment for Git SHA e5b2253 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughReorganizes the high-availability cluster deployment guide with a new Preparation section, prescriptive deployment steps, expanded proxy/storage/database/upgrade/troubleshooting content, and several cross-reference anchor fixes across related docs. ChangesHigh-Availability Documentation Restructure and Cross-Link Fixes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
source/administration-guide/scale/high-availability-cluster-based-deployment.rst (2)
540-540: Defaulting NFS mounts tosoftis risky for application data consistency.The examples on Lines 540/548/563 use
rw,soft,intr, while Line 573 only noteshard,intras an alternative. For shared app data,softcan return I/O errors under transient network issues and may cause partial writes.Suggested doc adjustment
- sudo mount -t nfs -o rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data + sudo mount -t nfs -o rw,hard,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data ... - NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0 + NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,hard,intr 0 0Also applies to: 548-548, 563-563, 573-573
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/administration-guide/scale/high-availability-cluster-based-deployment.rst` at line 540, The NFS mount examples use the risky option "rw,soft,intr" which can cause I/O errors and partial writes; change the example mount options to use "rw,hard,intr" (or show both options and explicitly recommend "hard,intr" for application data) and add a brief explanatory note after the command that for shared application data you should prefer hard mounts to avoid transient-network-induced partial writes and data corruption; update every occurrence of the example mount command (the string "sudo mount -t nfs -o rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data") and add a short caution sentence referencing "hard,intr" as the recommended setting.
709-709: Line 709 incorrectly claims RDS does not expose PostgreSQL configuration access.Both Aurora and RDS PostgreSQL expose configuration through DB parameter groups and cluster parameter groups, allowing administrators to tune settings like memory allocation, replication parameters, and other PostgreSQL options. Reword to clarify these configuration mechanisms are available, rather than suggesting configuration is unavailable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/administration-guide/scale/high-availability-cluster-based-deployment.rst` at line 709, Replace the incorrect sentence "Amazon RDS does not expose direct PostgreSQL configuration access" with a clarified statement that both Amazon RDS (for PostgreSQL) and Amazon Aurora expose configuration via DB parameter groups and cluster parameter groups; mention that these parameter groups allow tuning of memory, replication, and other PostgreSQL options and that monitoring should still be done via CloudWatch and RDS Performance Insights (locate the sentence containing "Amazon RDS does not expose direct PostgreSQL configuration access" in high-availability-cluster-based-deployment.rst and update it to reference DB parameter groups/cluster parameter groups and retained monitoring guidance).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 360-383: Clarify the health-check paragraph to explain the
difference between passive peer quarantining (controlled by max_fails=0) and
per-request active failover (controlled by proxy_next_upstream and similar
settings): explicitly state that setting max_fails=0 disables passive marking of
an upstream as unavailable, while NGINX can still perform per-request failover
using proxy_next_upstream so requests may be retried to other backends even when
peers are not quarantined; mention the Mattermost API ping endpoint
(http://SERVER_IP:8065/api/v4/system/ping) as a way to monitor server health but
note it does not change the distinction between passive and active failure
handling.
---
Nitpick comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Line 540: The NFS mount examples use the risky option "rw,soft,intr" which can
cause I/O errors and partial writes; change the example mount options to use
"rw,hard,intr" (or show both options and explicitly recommend "hard,intr" for
application data) and add a brief explanatory note after the command that for
shared application data you should prefer hard mounts to avoid
transient-network-induced partial writes and data corruption; update every
occurrence of the example mount command (the string "sudo mount -t nfs -o
rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data") and add a
short caution sentence referencing "hard,intr" as the recommended setting.
- Line 709: Replace the incorrect sentence "Amazon RDS does not expose direct
PostgreSQL configuration access" with a clarified statement that both Amazon RDS
(for PostgreSQL) and Amazon Aurora expose configuration via DB parameter groups
and cluster parameter groups; mention that these parameter groups allow tuning
of memory, replication, and other PostgreSQL options and that monitoring should
still be done via CloudWatch and RDS Performance Insights (locate the sentence
containing "Amazon RDS does not expose direct PostgreSQL configuration access"
in high-availability-cluster-based-deployment.rst and update it to reference DB
parameter groups/cluster parameter groups and retained monitoring guidance).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 85b65ef1-a743-47fb-ae48-49f1d75d59fd
📒 Files selected for processing (1)
source/administration-guide/scale/high-availability-cluster-based-deployment.rst
|
@neillcollie - Would you be able to help give this a technical review when you're able? |
neillcollie
left a comment
There was a problem hiding this comment.
Massive uplift compared with previous HA docs.
Recently implemented an HA env and my notes would have been to include additional NFS details. This PR now comprehensively covers NFS
Resolved conflict in source/administration-guide/scale/high-availability-cluster-based-deployment.rst: adopted master's corrected HA sysctl values from #8939 (tcp_rmem/tcp_wmem max 2500000, net.core.{r,w}mem_max 16777216, dropped rmem_default/wmem_default/ tcp_mem) while keeping this branch's restructure into a sudo tee heredoc with the added vm.min_free_kbytes line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Newest code from mattermost has been published to preview environment for Git SHA 8a50cb5 |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
source/administration-guide/scale/high-availability-cluster-based-deployment.rst (1)
385-386:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClarify health check behavior in relation to
max_fails=0setting.Line 385 states "NGINX automatically stops routing traffic to backend servers that fail to respond," which may mislead readers given the
max_fails=0setting on line 286-287. Withmax_fails=0, NGINX does not mark backends as unavailable based on failure count (passive quarantining is disabled), though per-request failover viaproxy_next_upstreamcan still retry individual requests to other backends. Consider clarifying that the monitoring endpoint helps detect issues, but withmax_fails=0, failed backends remain in the rotation unless removed manually or via external health checks.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/administration-guide/scale/high-availability-cluster-based-deployment.rst` around lines 385 - 386, The guidance about NGINX removing unhealthy backends is misleading given the existing upstream config using max_fails=0; update the paragraph that mentions the Mattermost health endpoint (http://SERVER_IP:8065/api/v4/system/ping) to clarify that with max_fails=0 the upstream will not be passively marked down based on failure count, that proxy_next_upstream can still retry individual requests, and recommend using external active health checks or manual removal of nodes to take servers out of rotation; reference the max_fails=0 and proxy_next_upstream settings so the reader can correlate behavior with the config.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 385-386: The guidance about NGINX removing unhealthy backends is
misleading given the existing upstream config using max_fails=0; update the
paragraph that mentions the Mattermost health endpoint
(http://SERVER_IP:8065/api/v4/system/ping) to clarify that with max_fails=0 the
upstream will not be passively marked down based on failure count, that
proxy_next_upstream can still retry individual requests, and recommend using
external active health checks or manual removal of nodes to take servers out of
rotation; reference the max_fails=0 and proxy_next_upstream settings so the
reader can correlate behavior with the config.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0836be12-b550-4155-b2ae-a22d478236ba
📒 Files selected for processing (1)
source/administration-guide/scale/high-availability-cluster-based-deployment.rst
There was a problem hiding this comment.
Pull request overview
This PR reorganizes the Mattermost high availability cluster deployment documentation around the admin workflow, adding preparation, deployment, next steps, and operations/maintenance sections while expanding guidance for file storage, proxy, database replicas, and production PostgreSQL settings.
Changes:
- Adds workflow-oriented HA preparation and deployment guidance.
- Expands NGINX, S3/NFS storage, PostgreSQL replica, and failover documentation.
- Moves operational topics such as cluster discovery, jobs, plugins, upgrades, FAQ, and troubleshooting under maintenance-oriented sections.
| 1. Back up your Mattermost database and the file storage location. See the :doc:`backup </deployment-guide/backup-disaster-recovery>` documentation for details. | ||
| 2. Modify your NGINX setup to remove the server. For information about this, see :ref:`proxy server configuration <deployment-guide/server/setup-nginx-proxy:manage the nginx process>` documentation for details. | ||
| 3. Open **System Console > Environment > High Availability** to verify that all the machines remaining in the cluster are communicating as expected with green status indicators. If not, investigate the log files for any extra information. | ||
| - **Non-Kubernetes deployments:** Follow the :doc:`Deploy Mattermost on Linux </deployment-guide/server/deploy-mattermost-on-linux>` instructions to install the same version of Mattermost on each additional server. |
| sudo systemctl restart mattermost | ||
|
|
||
| 9. **Verify cluster communication:** Open **System Console > Environment > High Availability** to verify that each server in the cluster is communicating as expected with green status indicators. If not, investigate the log files for additional information. | ||
|
|
|
|
||
| .. note:: | ||
| 7. **Verify proxy functionality:** Test access through the proxy using your configured domain name and verify traffic is distributed across backend servers by checking Mattermost server logs. | ||
|
|
| 3. Updating the Mattermost configuration to point to the new storage | ||
| 4. Verifying that all files are accessible | ||
|
|
||
|
|
|
|
||
| If you have non-standard (i.e. complex) network configurations, then you may need to use the :ref:`Override Hostname <administration-guide/configure/environment-configuration-settings:override hostname>` setting to help the cluster nodes discover each other. The cluster settings in the config are removed from the config file hash for this reason, meaning you can have slightly different cluster configuration settings in high availability mode. The Override Hostname is intended to be different for each clustered node if you need to force discovery. | ||
|
|
||
| If ``UseIpAddress`` is set to ``true``, it attempts to obtain the IP address by searching for the first non-local IP address (non-loop-back, non-localunicast, non-localmulticast network interface). It enumerates the network interfaces using the built-in go function `net.InterfaceAddrs() <https://pkg.go.dev/net#InterfaceAddrs>`_. Otherwise it tries to get the hostname using the `os.Hostname() <https://pkg.go.dev/os#Hostname>`_ built-in go function. |
|
|
||
| .. code-block:: text | ||
|
|
||
| NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0 |
|
|
||
| .. code-block:: text | ||
|
|
||
| 192.168.1.100:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0 |
| 5. **Configure TLS:** For production deployments, configure TLS on your NGINX proxy. See :doc:`Set up TLS </deployment-guide/server/setup-tls>` for detailed instructions on configuring TLS with NGINX. You can either use Let's Encrypt for automatic certificate management or provide your own TLS certificates. | ||
|
|
||
| Use the :ref:`read replica <administration-guide/configure/environment-configuration-settings:read replicas>` feature to scale the database. The Mattermost server can be set up to use one master database and one or more read replica databases. | ||
| 6. **Configure health checks:** NGINX automatically stops routing traffic to backend servers that fail to respond. You can monitor server health using the Mattermost API endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` which returns ``Status 200`` for healthy servers. |
| .. code-block:: bash | ||
|
|
||
| # On the replica server, create base backup from primary | ||
| sudo -u postgres pg_basebackup -h PRIMARY_IP -D /var/lib/postgresql/data -U replication_user -P -v -R -X stream | ||
|
|
||
| The ``-R`` flag automatically creates the ``standby.signal`` file and configures replication settings. | ||
|
|
| # On the replica server | ||
| sudo -u postgres pg_ctl promote -D /var/lib/postgresql/data | ||
|
|
- NFS mounts: switch shared data directory examples from rw,soft,intr to rw,hard
to prevent partial writes / file corruption during transient NFS outages.
Drop deprecated intr option (no-op since kernel 2.6.25). Invert the note to
frame hard as the safe default and soft as a documented availability tradeoff.
- NGINX health checks: rewrite item 6 to accurately describe what max_fails=0
does (disables passive quarantining), note that per-request failover via
proxy_next_upstream still works, and direct admins to external active health
checks against /api/v4/system/ping.
- Fix broken :doc: target deploy-mattermost-on-linux -> deploy-linux.
- Section rename anchor fixes: update the URL fragment in
enterprise-roll-out-checklist.rst (#proxy-server-configuration ->
#proxy-server) and the :ref: targets in backing-storage-benchmarks.rst,
environment-configuration-settings.rst, and manage/statistics.rst to match
the new "File storage" and "Database" headings.
- PostgreSQL replica/promote commands: replace the hardcoded
/var/lib/postgresql/data with a DATA_DIR placeholder plus inline
platform-specific paths (Ubuntu/Debian: /var/lib/postgresql/{version}/main,
RHEL/CentOS: /var/lib/pgsql/{version}/data). Note that pg_basebackup requires
the target data directory to be empty.
- Fix UseIpAddress -> UseIPAddress casing for consistency with the canonical
ClusterSettings.UseIPAddress key used elsewhere in the doc.
- Correct the Amazon RDS / Aurora PostgreSQL configuration claim: both expose
configuration through DB parameter groups (RDS) and DB cluster parameter
groups (Aurora).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Newest code from mattermost has been published to preview environment for Git SHA 23d46a2 |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
source/administration-guide/scale/high-availability-cluster-based-deployment.rst (2)
385-386: ⚡ Quick winSimplify the health check explanation for novice administrators.
The current explanation uses technical jargon ("passive quarantining", "proxy_next_upstream") without context, which may confuse administrators who are not deeply familiar with NGINX internals. The sentence is also dense, combining multiple concepts.
As per coding guidelines, evaluate documentation through the lens of Novice Nate—a novice IT Administrator with 1-2 years of experience who wants to understand commands before running them. Define technical terms briefly inline on first use and explain the 'why' behind settings when it matters for reader confidence.
📝 Suggested clearer explanation
-6. **Configure health checks:** The upstream block above sets ``max_fails=0``, which disables NGINX's passive quarantining - backends are not marked unavailable based on failed-request counts. Individual failed requests are still retried against other backends via NGINX's default ``proxy_next_upstream`` behavior. To detect failed servers and remove them from rotation, monitor the Mattermost API endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` (which returns ``Status 200`` for healthy servers) using an active health check at your load balancer or monitoring system. +6. **Configure health checks:** The upstream block above sets ``max_fails=0``, which tells NGINX to keep sending requests to all backend servers even if some requests fail. This is appropriate for high availability because Mattermost servers can handle brief errors without needing to be removed from rotation. If a request to one server fails, NGINX automatically retries it on another server in the cluster. To proactively monitor server health and detect servers that are truly down, configure your load balancer or monitoring system to check the Mattermost API health endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` (which returns ``Status 200`` when healthy).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/administration-guide/scale/high-availability-cluster-based-deployment.rst` around lines 385 - 386, Rewrite the paragraph that explains health checks to be simpler and beginner-friendly: briefly define the setting max_fails=0 (disables passive failure-based removal) and proxy_next_upstream (NGINX will try another backend on request failure), state why those defaults matter (they prevent automatic removal of unhealthy backends), and then give clear actionable guidance to use an active health check against the endpoint http://SERVER_IP:8065/api/v4/system/ping (returns HTTP 200 when healthy) so load balancers or monitoring systems can detect and remove failed servers; keep sentences short, avoid jargon, and define each technical term inline on first use.
711-711: ⚡ Quick winProvide more specific guidance or links for Aurora parameter group tuning.
This paragraph mentions parameter groups and tuning but doesn't provide actionable guidance on when tuning is needed or what to monitor for. For Novice Nate, "tune memory, replication, and other settings" is too vague to act on, and for Veteran Vince, "generally well-tuned" doesn't specify which workload characteristics might require tuning.
As per coding guidelines, explain the 'why' behind settings when it matters for reader confidence, and provide links to relevant documentation when referencing external systems.
📋 Suggested improvement
Consider adding:
- A link to AWS documentation on RDS parameter groups
- Brief guidance on specific metrics to monitor (e.g., CPU, connection count, query latency)
- When to consider tuning (e.g., "if you see high CPU utilization or query timeouts")
Example:
Amazon RDS and Aurora expose PostgreSQL configuration through `DB parameter groups <https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html>`_. Aurora's defaults are well-tuned for most workloads, but you should monitor CPU utilization, connection count, and query latency using Amazon CloudWatch and RDS Performance Insights. If you consistently see high CPU usage (>80%) or connection saturation, consider tuning ``max_connections`` or memory-related parameters in the parameter group.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/administration-guide/scale/high-availability-cluster-based-deployment.rst` at line 711, Update the paragraph about Aurora/RDS parameter groups to include a link to AWS parameter group docs, list specific metrics to monitor (e.g., CPU utilization, connection count, query latency, read/write IOPS, replica lag) and give concrete thresholds/triggers for tuning (e.g., CPU >80%, sustained high query latency, connection saturation or replica lag) and mention common parameters to adjust (e.g., max_connections, work_mem, shared_buffers, wal_level) so readers know what to monitor and when to edit the DB parameter group; reference the existing sentence about "DB parameter groups" / "Aurora's defaults" to replace the vague phrase with these actionable items and the AWS docs link.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@source/administration-guide/configure/environment-configuration-settings.rst`:
- Line 789: Update the misspelled visible link text "high availablility" to
"high availability" in the ref link(s) where the phrase appears; locate the ref
role usages like ":ref:`high availablility database configuration
<administration-guide/scale/high-availability-cluster-based-deployment:database>`"
and correct the visible label to ":ref:`high availability database configuration
<administration-guide/scale/high-availability-cluster-based-deployment:database>`"
in all three occurrences (around lines referenced) so the visible link text and
searches use the correct spelling.
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 990-994: Add a verification step before the promotion command to
ensure the replica is fully caught up: instruct the reader to run a replication
status check on the replica (using pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn() and comparing them, e.g., checking that
pg_last_wal_replay_lsn() = pg_last_wal_receive_lsn() evaluates to true) and wait
until synced is true before running the promotion command; then keep the
existing promotion instruction that runs sudo -u postgres pg_ctl promote -D
DATA_DIR and add an expected-success note indicating the replica should become
primary and accept writes.
- Around line 761-769: The step that runs pg_basebackup lacks an explicit
warning and prerequisites; update the "Create the replica" section to (1) add an
important notice that the operation will replace the replica's data directory
and that backups should be taken, (2) list prerequisites (stop PostgreSQL on the
replica, verify network connectivity to PRIMARY_IP, ensure DATA_DIR is empty,
and have a backup of any existing data), (3) break the procedure into atomic
numbered steps: stop PostgreSQL (e.g. systemctl stop postgresql), confirm/clear
DATA_DIR, then run the pg_basebackup command shown (referencing pg_basebackup,
PRIMARY_IP, DATA_DIR, replication_user, and the -R flag), and (4) add a short
note explaining that -R creates standby.signal and configures replication.
---
Nitpick comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 385-386: Rewrite the paragraph that explains health checks to be
simpler and beginner-friendly: briefly define the setting max_fails=0 (disables
passive failure-based removal) and proxy_next_upstream (NGINX will try another
backend on request failure), state why those defaults matter (they prevent
automatic removal of unhealthy backends), and then give clear actionable
guidance to use an active health check against the endpoint
http://SERVER_IP:8065/api/v4/system/ping (returns HTTP 200 when healthy) so load
balancers or monitoring systems can detect and remove failed servers; keep
sentences short, avoid jargon, and define each technical term inline on first
use.
- Line 711: Update the paragraph about Aurora/RDS parameter groups to include a
link to AWS parameter group docs, list specific metrics to monitor (e.g., CPU
utilization, connection count, query latency, read/write IOPS, replica lag) and
give concrete thresholds/triggers for tuning (e.g., CPU >80%, sustained high
query latency, connection saturation or replica lag) and mention common
parameters to adjust (e.g., max_connections, work_mem, shared_buffers,
wal_level) so readers know what to monitor and when to edit the DB parameter
group; reference the existing sentence about "DB parameter groups" / "Aurora's
defaults" to replace the vague phrase with these actionable items and the AWS
docs link.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: df6df8a1-1040-44af-8e1d-0c3ba18d61ce
📒 Files selected for processing (5)
source/administration-guide/configure/environment-configuration-settings.rstsource/administration-guide/manage/statistics.rstsource/administration-guide/scale/backing-storage-benchmarks.rstsource/administration-guide/scale/high-availability-cluster-based-deployment.rstsource/administration-guide/upgrade/enterprise-roll-out-checklist.rst
✅ Files skipped from review due to trivial changes (2)
- source/administration-guide/upgrade/enterprise-roll-out-checklist.rst
- source/administration-guide/manage/statistics.rst
| 4. **Create the replica** using ``pg_basebackup``. Stop PostgreSQL on the replica and ensure the data directory is empty before running this command. | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| # On the replica server, create base backup from primary. | ||
| # Replace DATA_DIR with the platform-specific data directory: | ||
| # Ubuntu/Debian: /var/lib/postgresql/{version}/main | ||
| # RHEL/CentOS: /var/lib/pgsql/{version}/data | ||
| sudo -u postgres pg_basebackup -h PRIMARY_IP -D DATA_DIR -U replication_user -P -v -R -X stream |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add explicit backup warning and clarify prerequisites for replica creation.
Creating a replica with pg_basebackup is a destructive operation that requires stopping PostgreSQL and ensuring an empty data directory. The current documentation mentions this casually but doesn't emphasize the risk or provide a clear prerequisite checklist.
As per coding guidelines, list prerequisites clearly, use numbered atomic steps for procedures, and explain the 'why' behind commands when it matters for reader confidence. For a destructive operation like this, Novice Nate needs explicit warnings and a clear sequence.
⚠️ Suggested improvement
Consider restructuring step 4 to make prerequisites explicit:
4. **Create the replica** using ``pg_basebackup``.
.. important::
This operation will completely replace the replica's data directory. Ensure you have:
- A backup of any existing data on the replica server
- Verified network connectivity from the replica to the primary server
- Stopped PostgreSQL on the replica server
On the replica server, stop PostgreSQL and clear the data directory:
.. code-block:: bash
sudo systemctl stop postgresql
# Ensure the data directory is empty
# Ubuntu/Debian: /var/lib/postgresql/{version}/main
# RHEL/CentOS: /var/lib/pgsql/{version}/data
Create the base backup from the primary:
.. code-block:: bash
# Replace DATA_DIR with your platform-specific data directory
# Replace PRIMARY_IP with your primary server's IP address
sudo -u postgres pg_basebackup -h PRIMARY_IP -D DATA_DIR -U replication_user -P -v -R -X stream
The ``-R`` flag automatically creates the ``standby.signal`` file and configures replication settings.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
around lines 761 - 769, The step that runs pg_basebackup lacks an explicit
warning and prerequisites; update the "Create the replica" section to (1) add an
important notice that the operation will replace the replica's data directory
and that backups should be taken, (2) list prerequisites (stop PostgreSQL on the
replica, verify network connectivity to PRIMARY_IP, ensure DATA_DIR is empty,
and have a backup of any existing data), (3) break the procedure into atomic
numbered steps: stop PostgreSQL (e.g. systemctl stop postgresql), confirm/clear
DATA_DIR, then run the pg_basebackup command shown (referencing pg_basebackup,
PRIMARY_IP, DATA_DIR, replication_user, and the -R flag), and (4) add a short
note explaining that -R creates standby.signal and configures replication.
| # On the replica server. | ||
| # Replace DATA_DIR with the platform-specific data directory: | ||
| # Ubuntu/Debian: /var/lib/postgresql/{version}/main | ||
| # RHEL/CentOS: /var/lib/pgsql/{version}/data | ||
| sudo -u postgres pg_ctl promote -D DATA_DIR |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add verification step before promoting replica to primary.
Promoting a replica to primary without verifying it has fully caught up with replication can result in data loss. The current instructions don't include a verification step to check replication lag before promotion.
As per coding guidelines, evaluate documentation through the lens of Novice Nate and include expected output or success checks after key steps to help readers verify progress. For a critical operation like database failover, Veteran Vince would expect verification guidance.
🔍 Suggested verification step
Add a verification step before the promotion command:
1. **Verify replica is synchronized** before promoting:
On the replica server, check replication status:
.. code-block:: sql
SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(),
pg_last_wal_replay_lsn() = pg_last_wal_receive_lsn() AS synced;
Wait until ``synced`` returns ``true``, indicating the replica has applied all received WAL data.
2. **Promote the replica** to primary:
For self-managed PostgreSQL:
.. code-block:: bash
# On the replica server.
# Replace DATA_DIR with the platform-specific data directory:
# Ubuntu/Debian: /var/lib/postgresql/{version}/main
# RHEL/CentOS: /var/lib/pgsql/{version}/data
sudo -u postgres pg_ctl promote -D DATA_DIR
For Amazon RDS:
Use the AWS Console or CLI to promote the read replica to a standalone instance.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
around lines 990 - 994, Add a verification step before the promotion command to
ensure the replica is fully caught up: instruct the reader to run a replication
status check on the replica (using pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn() and comparing them, e.g., checking that
pg_last_wal_replay_lsn() = pg_last_wal_receive_lsn() evaluates to true) and wait
until synced is true before running the promotion command; then keep the
existing promotion instruction that runs sudo -u postgres pg_ctl promote -D
DATA_DIR and add an expected-success note indicating the replica should become
primary and accept writes.
Pre-existing typo flagged by CodeRabbit while reviewing the heading-rename ref-target updates from the previous commit. Three occurrences in the visible link label of cross-references to the HA deployment docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Newest code from mattermost has been published to preview environment for Git SHA e413b59 |
Restructure the HA deployment documentation to match admin workflow:
Closes #8811
Generated with Claude Code