Skip to content

Re-organize high availability cluster-based deployment docs#8812

Open
roberson-io wants to merge 5 commits into
masterfrom
claude/issue-8811-20260310-1635
Open

Re-organize high availability cluster-based deployment docs#8812
roberson-io wants to merge 5 commits into
masterfrom
claude/issue-8811-20260310-1635

Conversation

@roberson-io
Copy link
Copy Markdown
Contributor

@roberson-io roberson-io commented Mar 10, 2026

Restructure the HA deployment documentation to match admin workflow:

  • Add Preparation section with pre-deployment guidance
  • Add Deployment guide section with step-by-step instructions
  • Add Next steps section for scaling optimizations
  • Preserve Operations and maintenance section for advanced topics
  • Emphasize PostgreSQL hot_standby and hot_standby_feedback settings
  • Guide admins toward database configuration over config.json

Closes #8811

Generated with Claude Code

Restructure the HA deployment documentation to match admin workflow:
- Add Preparation section with pre-deployment guidance
- Add Deployment guide section with step-by-step instructions
- Add Next steps section for scaling optimizations
- Preserve Operations and maintenance section for advanced topics
- Emphasize PostgreSQL hot_standby and hot_standby_feedback settings
- Guide admins toward database configuration over config.json

Closes #8811

Co-authored-by: Michael Roberson <roberson-io@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA 1bb1c44

@roberson-io roberson-io requested a review from neillcollie March 10, 2026 17:17
@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA e5b2253

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 13, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0725dbb1-7ca4-400b-8916-7daa5b84d902

📥 Commits

Reviewing files that changed from the base of the PR and between 23d46a2 and e413b59.

📒 Files selected for processing (1)
  • source/administration-guide/configure/environment-configuration-settings.rst
✅ Files skipped from review due to trivial changes (1)
  • source/administration-guide/configure/environment-configuration-settings.rst

📝 Walkthrough

Walkthrough

Reorganizes the high-availability cluster deployment guide with a new Preparation section, prescriptive deployment steps, expanded proxy/storage/database/upgrade/troubleshooting content, and several cross-reference anchor fixes across related docs.

Changes

High-Availability Documentation Restructure and Cross-Link Fixes

Layer / File(s) Summary
Preparation and full deployment workflow
source/administration-guide/scale/high-availability-cluster-based-deployment.rst
Adds Preparation and a prescriptive multi-step HA deployment workflow covering Mattermost server provisioning (Kubernetes vs non-Kubernetes), systemd limits, sysctl/network tuning, time sync, per-node pre-clustering checks, cluster settings via mmctl, shared storage, and DB replica guidance.
Job server and plugins subsection
source/administration-guide/scale/high-availability-cluster-based-deployment.rst
Adds Job server section and scheduling guidance requiring JobSettings.RunScheduler=true, with mmctl verification/set commands; introduces Plugins and High Availability subsection header.
CLI, configuration update workflows, and rolling/server updates
source/administration-guide/scale/high-availability-cluster-based-deployment.rst
Replaces CLI guidance (CLI runs on a single node; recommend mmctl), separates config update paths (mmctl vs config.json), warns about System Console vs config.json divergence, and adds rolling dot-release update and interruption criteria for upgrades.
NGINX sequencing and continuous-operation constraints
source/administration-guide/scale/high-availability-cluster-based-deployment.rst
Adds explicit NGINX stop/upgrade/start/restart sequencing for service-interruption upgrades, documents applying config.json from backups, and adds gossip/protocol single-protocol requirement and continuous-operation constraints.
FAQ and troubleshooting revisions
source/administration-guide/scale/high-availability-cluster-based-deployment.rst
Adjusts FAQ heading, adds "Capture high availability troubleshooting data" heading, and rewrites troubleshooting entries for continuous config refresh and message posting/reload behaviours, splitting solutions by DB-backed config vs config.json.
Cross-reference and small doc anchor fixes
source/administration-guide/configure/environment-configuration-settings.rst, source/administration-guide/manage/statistics.rst, source/administration-guide/scale/backing-storage-benchmarks.rst, source/administration-guide/upgrade/enterprise-roll-out-checklist.rst
Corrects Sphinx :ref: targets and anchor links to point to the updated high-availability anchors for AWS RDS guidance, Replica DB Conns, backing-storage testing notes, and the enterprise roll-out proxy-server checklist link.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: reorganizing the high-availability cluster-based deployment documentation.
Description check ✅ Passed The description is directly related to the changeset, outlining the restructured sections and documentation improvements made.
Linked Issues check ✅ Passed The pull request successfully addresses all key objectives from issue #8811: creates Preparation, Deployment guide, and Next steps sections; preserves Operations and maintenance, FAQ, and Troubleshooting sections; emphasizes PostgreSQL hot_standby settings; and guides toward database configuration.
Out of Scope Changes check ✅ Passed All changes are within scope. The PR includes the main restructuring of high-availability documentation and necessary cross-reference updates in related documentation files to maintain consistency.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/issue-8811-20260310-1635

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
source/administration-guide/scale/high-availability-cluster-based-deployment.rst (2)

540-540: Defaulting NFS mounts to soft is risky for application data consistency.

The examples on Lines 540/548/563 use rw,soft,intr, while Line 573 only notes hard,intr as an alternative. For shared app data, soft can return I/O errors under transient network issues and may cause partial writes.

Suggested doc adjustment
- sudo mount -t nfs -o rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data
+ sudo mount -t nfs -o rw,hard,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data
...
- NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0
+ NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,hard,intr 0 0

Also applies to: 548-548, 563-563, 573-573

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
at line 540, The NFS mount examples use the risky option "rw,soft,intr" which
can cause I/O errors and partial writes; change the example mount options to use
"rw,hard,intr" (or show both options and explicitly recommend "hard,intr" for
application data) and add a brief explanatory note after the command that for
shared application data you should prefer hard mounts to avoid
transient-network-induced partial writes and data corruption; update every
occurrence of the example mount command (the string "sudo mount -t nfs -o
rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data") and add a
short caution sentence referencing "hard,intr" as the recommended setting.

709-709: Line 709 incorrectly claims RDS does not expose PostgreSQL configuration access.

Both Aurora and RDS PostgreSQL expose configuration through DB parameter groups and cluster parameter groups, allowing administrators to tune settings like memory allocation, replication parameters, and other PostgreSQL options. Reword to clarify these configuration mechanisms are available, rather than suggesting configuration is unavailable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
at line 709, Replace the incorrect sentence "Amazon RDS does not expose direct
PostgreSQL configuration access" with a clarified statement that both Amazon RDS
(for PostgreSQL) and Amazon Aurora expose configuration via DB parameter groups
and cluster parameter groups; mention that these parameter groups allow tuning
of memory, replication, and other PostgreSQL options and that monitoring should
still be done via CloudWatch and RDS Performance Insights (locate the sentence
containing "Amazon RDS does not expose direct PostgreSQL configuration access"
in high-availability-cluster-based-deployment.rst and update it to reference DB
parameter groups/cluster parameter groups and retained monitoring guidance).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 360-383: Clarify the health-check paragraph to explain the
difference between passive peer quarantining (controlled by max_fails=0) and
per-request active failover (controlled by proxy_next_upstream and similar
settings): explicitly state that setting max_fails=0 disables passive marking of
an upstream as unavailable, while NGINX can still perform per-request failover
using proxy_next_upstream so requests may be retried to other backends even when
peers are not quarantined; mention the Mattermost API ping endpoint
(http://SERVER_IP:8065/api/v4/system/ping) as a way to monitor server health but
note it does not change the distinction between passive and active failure
handling.

---

Nitpick comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Line 540: The NFS mount examples use the risky option "rw,soft,intr" which can
cause I/O errors and partial writes; change the example mount options to use
"rw,hard,intr" (or show both options and explicitly recommend "hard,intr" for
application data) and add a brief explanatory note after the command that for
shared application data you should prefer hard mounts to avoid
transient-network-induced partial writes and data corruption; update every
occurrence of the example mount command (the string "sudo mount -t nfs -o
rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data") and add a
short caution sentence referencing "hard,intr" as the recommended setting.
- Line 709: Replace the incorrect sentence "Amazon RDS does not expose direct
PostgreSQL configuration access" with a clarified statement that both Amazon RDS
(for PostgreSQL) and Amazon Aurora expose configuration via DB parameter groups
and cluster parameter groups; mention that these parameter groups allow tuning
of memory, replication, and other PostgreSQL options and that monitoring should
still be done via CloudWatch and RDS Performance Insights (locate the sentence
containing "Amazon RDS does not expose direct PostgreSQL configuration access"
in high-availability-cluster-based-deployment.rst and update it to reference DB
parameter groups/cluster parameter groups and retained monitoring guidance).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85b65ef1-a743-47fb-ae48-49f1d75d59fd

📥 Commits

Reviewing files that changed from the base of the PR and between 741ca93 and e5b2253.

📒 Files selected for processing (1)
  • source/administration-guide/scale/high-availability-cluster-based-deployment.rst

Comment thread source/administration-guide/scale/high-availability-cluster-based-deployment.rst Outdated
@Combs7th
Copy link
Copy Markdown
Contributor

@neillcollie - Would you be able to help give this a technical review when you're able?

Copy link
Copy Markdown

@neillcollie neillcollie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Massive uplift compared with previous HA docs.
Recently implemented an HA env and my notes would have been to include additional NFS details. This PR now comprehensively covers NFS

Resolved conflict in source/administration-guide/scale/high-availability-cluster-based-deployment.rst:
adopted master's corrected HA sysctl values from #8939 (tcp_rmem/tcp_wmem max
2500000, net.core.{r,w}mem_max 16777216, dropped rmem_default/wmem_default/
tcp_mem) while keeping this branch's restructure into a sudo tee heredoc with
the added vm.min_free_kbytes line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 21:23
@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA 8a50cb5

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
source/administration-guide/scale/high-availability-cluster-based-deployment.rst (1)

385-386: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify health check behavior in relation to max_fails=0 setting.

Line 385 states "NGINX automatically stops routing traffic to backend servers that fail to respond," which may mislead readers given the max_fails=0 setting on line 286-287. With max_fails=0, NGINX does not mark backends as unavailable based on failure count (passive quarantining is disabled), though per-request failover via proxy_next_upstream can still retry individual requests to other backends. Consider clarifying that the monitoring endpoint helps detect issues, but with max_fails=0, failed backends remain in the rotation unless removed manually or via external health checks.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
around lines 385 - 386, The guidance about NGINX removing unhealthy backends is
misleading given the existing upstream config using max_fails=0; update the
paragraph that mentions the Mattermost health endpoint
(http://SERVER_IP:8065/api/v4/system/ping) to clarify that with max_fails=0 the
upstream will not be passively marked down based on failure count, that
proxy_next_upstream can still retry individual requests, and recommend using
external active health checks or manual removal of nodes to take servers out of
rotation; reference the max_fails=0 and proxy_next_upstream settings so the
reader can correlate behavior with the config.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 385-386: The guidance about NGINX removing unhealthy backends is
misleading given the existing upstream config using max_fails=0; update the
paragraph that mentions the Mattermost health endpoint
(http://SERVER_IP:8065/api/v4/system/ping) to clarify that with max_fails=0 the
upstream will not be passively marked down based on failure count, that
proxy_next_upstream can still retry individual requests, and recommend using
external active health checks or manual removal of nodes to take servers out of
rotation; reference the max_fails=0 and proxy_next_upstream settings so the
reader can correlate behavior with the config.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0836be12-b550-4155-b2ae-a22d478236ba

📥 Commits

Reviewing files that changed from the base of the PR and between e5b2253 and 8a50cb5.

📒 Files selected for processing (1)
  • source/administration-guide/scale/high-availability-cluster-based-deployment.rst

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reorganizes the Mattermost high availability cluster deployment documentation around the admin workflow, adding preparation, deployment, next steps, and operations/maintenance sections while expanding guidance for file storage, proxy, database replicas, and production PostgreSQL settings.

Changes:

  • Adds workflow-oriented HA preparation and deployment guidance.
  • Expands NGINX, S3/NFS storage, PostgreSQL replica, and failover documentation.
  • Moves operational topics such as cluster discovery, jobs, plugins, upgrades, FAQ, and troubleshooting under maintenance-oriented sections.

1. Back up your Mattermost database and the file storage location. See the :doc:`backup </deployment-guide/backup-disaster-recovery>` documentation for details.
2. Modify your NGINX setup to remove the server. For information about this, see :ref:`proxy server configuration <deployment-guide/server/setup-nginx-proxy:manage the nginx process>` documentation for details.
3. Open **System Console > Environment > High Availability** to verify that all the machines remaining in the cluster are communicating as expected with green status indicators. If not, investigate the log files for any extra information.
- **Non-Kubernetes deployments:** Follow the :doc:`Deploy Mattermost on Linux </deployment-guide/server/deploy-mattermost-on-linux>` instructions to install the same version of Mattermost on each additional server.
sudo systemctl restart mattermost

9. **Verify cluster communication:** Open **System Console > Environment > High Availability** to verify that each server in the cluster is communicating as expected with green status indicators. If not, investigate the log files for additional information.


.. note::
7. **Verify proxy functionality:** Test access through the proxy using your configured domain name and verify traffic is distributed across backend servers by checking Mattermost server logs.

3. Updating the Mattermost configuration to point to the new storage
4. Verifying that all files are accessible



If you have non-standard (i.e. complex) network configurations, then you may need to use the :ref:`Override Hostname <administration-guide/configure/environment-configuration-settings:override hostname>` setting to help the cluster nodes discover each other. The cluster settings in the config are removed from the config file hash for this reason, meaning you can have slightly different cluster configuration settings in high availability mode. The Override Hostname is intended to be different for each clustered node if you need to force discovery.

If ``UseIpAddress`` is set to ``true``, it attempts to obtain the IP address by searching for the first non-local IP address (non-loop-back, non-localunicast, non-localmulticast network interface). It enumerates the network interfaces using the built-in go function `net.InterfaceAddrs() <https://pkg.go.dev/net#InterfaceAddrs>`_. Otherwise it tries to get the hostname using the `os.Hostname() <https://pkg.go.dev/os#Hostname>`_ built-in go function.

.. code-block:: text

NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0

.. code-block:: text

192.168.1.100:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0
5. **Configure TLS:** For production deployments, configure TLS on your NGINX proxy. See :doc:`Set up TLS </deployment-guide/server/setup-tls>` for detailed instructions on configuring TLS with NGINX. You can either use Let's Encrypt for automatic certificate management or provide your own TLS certificates.

Use the :ref:`read replica <administration-guide/configure/environment-configuration-settings:read replicas>` feature to scale the database. The Mattermost server can be set up to use one master database and one or more read replica databases.
6. **Configure health checks:** NGINX automatically stops routing traffic to backend servers that fail to respond. You can monitor server health using the Mattermost API endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` which returns ``Status 200`` for healthy servers.
Comment on lines +763 to +769
.. code-block:: bash

# On the replica server, create base backup from primary
sudo -u postgres pg_basebackup -h PRIMARY_IP -D /var/lib/postgresql/data -U replication_user -P -v -R -X stream

The ``-R`` flag automatically creates the ``standby.signal`` file and configures replication settings.

Comment on lines +987 to +989
# On the replica server
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/data

- NFS mounts: switch shared data directory examples from rw,soft,intr to rw,hard
  to prevent partial writes / file corruption during transient NFS outages.
  Drop deprecated intr option (no-op since kernel 2.6.25). Invert the note to
  frame hard as the safe default and soft as a documented availability tradeoff.
- NGINX health checks: rewrite item 6 to accurately describe what max_fails=0
  does (disables passive quarantining), note that per-request failover via
  proxy_next_upstream still works, and direct admins to external active health
  checks against /api/v4/system/ping.
- Fix broken :doc: target deploy-mattermost-on-linux -> deploy-linux.
- Section rename anchor fixes: update the URL fragment in
  enterprise-roll-out-checklist.rst (#proxy-server-configuration ->
  #proxy-server) and the :ref: targets in backing-storage-benchmarks.rst,
  environment-configuration-settings.rst, and manage/statistics.rst to match
  the new "File storage" and "Database" headings.
- PostgreSQL replica/promote commands: replace the hardcoded
  /var/lib/postgresql/data with a DATA_DIR placeholder plus inline
  platform-specific paths (Ubuntu/Debian: /var/lib/postgresql/{version}/main,
  RHEL/CentOS: /var/lib/pgsql/{version}/data). Note that pg_basebackup requires
  the target data directory to be empty.
- Fix UseIpAddress -> UseIPAddress casing for consistency with the canonical
  ClusterSettings.UseIPAddress key used elsewhere in the doc.
- Correct the Amazon RDS / Aurora PostgreSQL configuration claim: both expose
  configuration through DB parameter groups (RDS) and DB cluster parameter
  groups (Aurora).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA 23d46a2

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
source/administration-guide/scale/high-availability-cluster-based-deployment.rst (2)

385-386: ⚡ Quick win

Simplify the health check explanation for novice administrators.

The current explanation uses technical jargon ("passive quarantining", "proxy_next_upstream") without context, which may confuse administrators who are not deeply familiar with NGINX internals. The sentence is also dense, combining multiple concepts.

As per coding guidelines, evaluate documentation through the lens of Novice Nate—a novice IT Administrator with 1-2 years of experience who wants to understand commands before running them. Define technical terms briefly inline on first use and explain the 'why' behind settings when it matters for reader confidence.

📝 Suggested clearer explanation
-6. **Configure health checks:** The upstream block above sets ``max_fails=0``, which disables NGINX's passive quarantining - backends are not marked unavailable based on failed-request counts. Individual failed requests are still retried against other backends via NGINX's default ``proxy_next_upstream`` behavior. To detect failed servers and remove them from rotation, monitor the Mattermost API endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` (which returns ``Status 200`` for healthy servers) using an active health check at your load balancer or monitoring system.
+6. **Configure health checks:** The upstream block above sets ``max_fails=0``, which tells NGINX to keep sending requests to all backend servers even if some requests fail. This is appropriate for high availability because Mattermost servers can handle brief errors without needing to be removed from rotation. If a request to one server fails, NGINX automatically retries it on another server in the cluster. To proactively monitor server health and detect servers that are truly down, configure your load balancer or monitoring system to check the Mattermost API health endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` (which returns ``Status 200`` when healthy).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
around lines 385 - 386, Rewrite the paragraph that explains health checks to be
simpler and beginner-friendly: briefly define the setting max_fails=0 (disables
passive failure-based removal) and proxy_next_upstream (NGINX will try another
backend on request failure), state why those defaults matter (they prevent
automatic removal of unhealthy backends), and then give clear actionable
guidance to use an active health check against the endpoint
http://SERVER_IP:8065/api/v4/system/ping (returns HTTP 200 when healthy) so load
balancers or monitoring systems can detect and remove failed servers; keep
sentences short, avoid jargon, and define each technical term inline on first
use.

711-711: ⚡ Quick win

Provide more specific guidance or links for Aurora parameter group tuning.

This paragraph mentions parameter groups and tuning but doesn't provide actionable guidance on when tuning is needed or what to monitor for. For Novice Nate, "tune memory, replication, and other settings" is too vague to act on, and for Veteran Vince, "generally well-tuned" doesn't specify which workload characteristics might require tuning.

As per coding guidelines, explain the 'why' behind settings when it matters for reader confidence, and provide links to relevant documentation when referencing external systems.

📋 Suggested improvement

Consider adding:

  • A link to AWS documentation on RDS parameter groups
  • Brief guidance on specific metrics to monitor (e.g., CPU, connection count, query latency)
  • When to consider tuning (e.g., "if you see high CPU utilization or query timeouts")

Example:

Amazon RDS and Aurora expose PostgreSQL configuration through `DB parameter groups <https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html>`_. Aurora's defaults are well-tuned for most workloads, but you should monitor CPU utilization, connection count, and query latency using Amazon CloudWatch and RDS Performance Insights. If you consistently see high CPU usage (>80%) or connection saturation, consider tuning ``max_connections`` or memory-related parameters in the parameter group.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
at line 711, Update the paragraph about Aurora/RDS parameter groups to include a
link to AWS parameter group docs, list specific metrics to monitor (e.g., CPU
utilization, connection count, query latency, read/write IOPS, replica lag) and
give concrete thresholds/triggers for tuning (e.g., CPU >80%, sustained high
query latency, connection saturation or replica lag) and mention common
parameters to adjust (e.g., max_connections, work_mem, shared_buffers,
wal_level) so readers know what to monitor and when to edit the DB parameter
group; reference the existing sentence about "DB parameter groups" / "Aurora's
defaults" to replace the vague phrase with these actionable items and the AWS
docs link.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@source/administration-guide/configure/environment-configuration-settings.rst`:
- Line 789: Update the misspelled visible link text "high availablility" to
"high availability" in the ref link(s) where the phrase appears; locate the ref
role usages like ":ref:`high availablility database configuration
<administration-guide/scale/high-availability-cluster-based-deployment:database>`"
and correct the visible label to ":ref:`high availability database configuration
<administration-guide/scale/high-availability-cluster-based-deployment:database>`"
in all three occurrences (around lines referenced) so the visible link text and
searches use the correct spelling.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 990-994: Add a verification step before the promotion command to
ensure the replica is fully caught up: instruct the reader to run a replication
status check on the replica (using pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn() and comparing them, e.g., checking that
pg_last_wal_replay_lsn() = pg_last_wal_receive_lsn() evaluates to true) and wait
until synced is true before running the promotion command; then keep the
existing promotion instruction that runs sudo -u postgres pg_ctl promote -D
DATA_DIR and add an expected-success note indicating the replica should become
primary and accept writes.
- Around line 761-769: The step that runs pg_basebackup lacks an explicit
warning and prerequisites; update the "Create the replica" section to (1) add an
important notice that the operation will replace the replica's data directory
and that backups should be taken, (2) list prerequisites (stop PostgreSQL on the
replica, verify network connectivity to PRIMARY_IP, ensure DATA_DIR is empty,
and have a backup of any existing data), (3) break the procedure into atomic
numbered steps: stop PostgreSQL (e.g. systemctl stop postgresql), confirm/clear
DATA_DIR, then run the pg_basebackup command shown (referencing pg_basebackup,
PRIMARY_IP, DATA_DIR, replication_user, and the -R flag), and (4) add a short
note explaining that -R creates standby.signal and configures replication.

---

Nitpick comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 385-386: Rewrite the paragraph that explains health checks to be
simpler and beginner-friendly: briefly define the setting max_fails=0 (disables
passive failure-based removal) and proxy_next_upstream (NGINX will try another
backend on request failure), state why those defaults matter (they prevent
automatic removal of unhealthy backends), and then give clear actionable
guidance to use an active health check against the endpoint
http://SERVER_IP:8065/api/v4/system/ping (returns HTTP 200 when healthy) so load
balancers or monitoring systems can detect and remove failed servers; keep
sentences short, avoid jargon, and define each technical term inline on first
use.
- Line 711: Update the paragraph about Aurora/RDS parameter groups to include a
link to AWS parameter group docs, list specific metrics to monitor (e.g., CPU
utilization, connection count, query latency, read/write IOPS, replica lag) and
give concrete thresholds/triggers for tuning (e.g., CPU >80%, sustained high
query latency, connection saturation or replica lag) and mention common
parameters to adjust (e.g., max_connections, work_mem, shared_buffers,
wal_level) so readers know what to monitor and when to edit the DB parameter
group; reference the existing sentence about "DB parameter groups" / "Aurora's
defaults" to replace the vague phrase with these actionable items and the AWS
docs link.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: df6df8a1-1040-44af-8e1d-0c3ba18d61ce

📥 Commits

Reviewing files that changed from the base of the PR and between 8a50cb5 and 23d46a2.

📒 Files selected for processing (5)
  • source/administration-guide/configure/environment-configuration-settings.rst
  • source/administration-guide/manage/statistics.rst
  • source/administration-guide/scale/backing-storage-benchmarks.rst
  • source/administration-guide/scale/high-availability-cluster-based-deployment.rst
  • source/administration-guide/upgrade/enterprise-roll-out-checklist.rst
✅ Files skipped from review due to trivial changes (2)
  • source/administration-guide/upgrade/enterprise-roll-out-checklist.rst
  • source/administration-guide/manage/statistics.rst

Comment thread source/administration-guide/configure/environment-configuration-settings.rst Outdated
Comment on lines +761 to +769
4. **Create the replica** using ``pg_basebackup``. Stop PostgreSQL on the replica and ensure the data directory is empty before running this command.

.. code-block:: bash

# On the replica server, create base backup from primary.
# Replace DATA_DIR with the platform-specific data directory:
# Ubuntu/Debian: /var/lib/postgresql/{version}/main
# RHEL/CentOS: /var/lib/pgsql/{version}/data
sudo -u postgres pg_basebackup -h PRIMARY_IP -D DATA_DIR -U replication_user -P -v -R -X stream
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add explicit backup warning and clarify prerequisites for replica creation.

Creating a replica with pg_basebackup is a destructive operation that requires stopping PostgreSQL and ensuring an empty data directory. The current documentation mentions this casually but doesn't emphasize the risk or provide a clear prerequisite checklist.

As per coding guidelines, list prerequisites clearly, use numbered atomic steps for procedures, and explain the 'why' behind commands when it matters for reader confidence. For a destructive operation like this, Novice Nate needs explicit warnings and a clear sequence.

⚠️ Suggested improvement

Consider restructuring step 4 to make prerequisites explicit:

4. **Create the replica** using ``pg_basebackup``.

   .. important::
   
      This operation will completely replace the replica's data directory. Ensure you have:
      
      - A backup of any existing data on the replica server
      - Verified network connectivity from the replica to the primary server
      - Stopped PostgreSQL on the replica server

   On the replica server, stop PostgreSQL and clear the data directory:

   .. code-block:: bash

      sudo systemctl stop postgresql
      # Ensure the data directory is empty
      # Ubuntu/Debian: /var/lib/postgresql/{version}/main
      # RHEL/CentOS: /var/lib/pgsql/{version}/data

   Create the base backup from the primary:

   .. code-block:: bash

      # Replace DATA_DIR with your platform-specific data directory
      # Replace PRIMARY_IP with your primary server's IP address
      sudo -u postgres pg_basebackup -h PRIMARY_IP -D DATA_DIR -U replication_user -P -v -R -X stream

   The ``-R`` flag automatically creates the ``standby.signal`` file and configures replication settings.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
around lines 761 - 769, The step that runs pg_basebackup lacks an explicit
warning and prerequisites; update the "Create the replica" section to (1) add an
important notice that the operation will replace the replica's data directory
and that backups should be taken, (2) list prerequisites (stop PostgreSQL on the
replica, verify network connectivity to PRIMARY_IP, ensure DATA_DIR is empty,
and have a backup of any existing data), (3) break the procedure into atomic
numbered steps: stop PostgreSQL (e.g. systemctl stop postgresql), confirm/clear
DATA_DIR, then run the pg_basebackup command shown (referencing pg_basebackup,
PRIMARY_IP, DATA_DIR, replication_user, and the -R flag), and (4) add a short
note explaining that -R creates standby.signal and configures replication.

Comment on lines +990 to +994
# On the replica server.
# Replace DATA_DIR with the platform-specific data directory:
# Ubuntu/Debian: /var/lib/postgresql/{version}/main
# RHEL/CentOS: /var/lib/pgsql/{version}/data
sudo -u postgres pg_ctl promote -D DATA_DIR
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add verification step before promoting replica to primary.

Promoting a replica to primary without verifying it has fully caught up with replication can result in data loss. The current instructions don't include a verification step to check replication lag before promotion.

As per coding guidelines, evaluate documentation through the lens of Novice Nate and include expected output or success checks after key steps to help readers verify progress. For a critical operation like database failover, Veteran Vince would expect verification guidance.

🔍 Suggested verification step

Add a verification step before the promotion command:

1. **Verify replica is synchronized** before promoting:

   On the replica server, check replication status:

   .. code-block:: sql

      SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), 
             pg_last_wal_replay_lsn() = pg_last_wal_receive_lsn() AS synced;

   Wait until ``synced`` returns ``true``, indicating the replica has applied all received WAL data.

2. **Promote the replica** to primary:

   For self-managed PostgreSQL:

   .. code-block:: bash

      # On the replica server.
      # Replace DATA_DIR with the platform-specific data directory:
      #   Ubuntu/Debian: /var/lib/postgresql/{version}/main
      #   RHEL/CentOS: /var/lib/pgsql/{version}/data
      sudo -u postgres pg_ctl promote -D DATA_DIR

   For Amazon RDS:

   Use the AWS Console or CLI to promote the read replica to a standalone instance.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
around lines 990 - 994, Add a verification step before the promotion command to
ensure the replica is fully caught up: instruct the reader to run a replication
status check on the replica (using pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn() and comparing them, e.g., checking that
pg_last_wal_replay_lsn() = pg_last_wal_receive_lsn() evaluates to true) and wait
until synced is true before running the promotion command; then keep the
existing promotion instruction that runs sudo -u postgres pg_ctl promote -D
DATA_DIR and add an expected-success note indicating the replica should become
primary and accept writes.

Pre-existing typo flagged by CodeRabbit while reviewing the heading-rename
ref-target updates from the previous commit. Three occurrences in the visible
link label of cross-references to the HA deployment docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA e413b59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-organize "High availability cluster-based deployment" docs page

6 participants