Re-organize high availability cluster-based deployment docs by roberson-io · Pull Request #8812 · mattermost/docs

roberson-io · 2026-03-10T16:47:37Z

Restructure the HA deployment documentation to match admin workflow:

Add Preparation section with pre-deployment guidance
Add Deployment guide section with step-by-step instructions
Add Next steps section for scaling optimizations
Preserve Operations and maintenance section for advanced topics
Emphasize PostgreSQL hot_standby and hot_standby_feedback settings
Guide admins toward database configuration over config.json

Restructure the HA deployment documentation to match admin workflow: - Add Preparation section with pre-deployment guidance - Add Deployment guide section with step-by-step instructions - Add Next steps section for scaling optimizations - Preserve Operations and maintenance section for advanced topics - Emphasize PostgreSQL hot_standby and hot_standby_feedback settings - Guide admins toward database configuration over config.json Closes #8811 Co-authored-by: Michael Roberson <roberson-io@users.noreply.github.com>

github-actions · 2026-03-10T16:51:10Z

Newest code from mattermost has been published to preview environment for Git SHA 1bb1c44

github-actions · 2026-03-13T14:34:22Z

Newest code from mattermost has been published to preview environment for Git SHA e5b2253

coderabbitai · 2026-03-13T14:39:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0725dbb1-7ca4-400b-8916-7daa5b84d902

📥 Commits

Reviewing files that changed from the base of the PR and between 23d46a2 and e413b59.

📒 Files selected for processing (1)

source/administration-guide/configure/environment-configuration-settings.rst

✅ Files skipped from review due to trivial changes (1)

source/administration-guide/configure/environment-configuration-settings.rst

📝 Walkthrough

Walkthrough

Reorganizes the high-availability cluster deployment guide with a new Preparation section, prescriptive deployment steps, expanded proxy/storage/database/upgrade/troubleshooting content, and several cross-reference anchor fixes across related docs.

Changes

High-Availability Documentation Restructure and Cross-Link Fixes

Layer / File(s)	Summary
Preparation and full deployment workflow `source/administration-guide/scale/high-availability-cluster-based-deployment.rst`	Adds Preparation and a prescriptive multi-step HA deployment workflow covering Mattermost server provisioning (Kubernetes vs non-Kubernetes), systemd limits, sysctl/network tuning, time sync, per-node pre-clustering checks, cluster settings via `mmctl`, shared storage, and DB replica guidance.
Job server and plugins subsection `source/administration-guide/scale/high-availability-cluster-based-deployment.rst`	Adds Job server section and scheduling guidance requiring `JobSettings.RunScheduler=true`, with `mmctl` verification/set commands; introduces Plugins and High Availability subsection header.
CLI, configuration update workflows, and rolling/server updates `source/administration-guide/scale/high-availability-cluster-based-deployment.rst`	Replaces CLI guidance (CLI runs on a single node; recommend `mmctl`), separates config update paths (`mmctl` vs `config.json`), warns about System Console vs `config.json` divergence, and adds rolling dot-release update and interruption criteria for upgrades.
NGINX sequencing and continuous-operation constraints `source/administration-guide/scale/high-availability-cluster-based-deployment.rst`	Adds explicit NGINX stop/upgrade/start/restart sequencing for service-interruption upgrades, documents applying `config.json` from backups, and adds gossip/protocol single-protocol requirement and continuous-operation constraints.
FAQ and troubleshooting revisions `source/administration-guide/scale/high-availability-cluster-based-deployment.rst`	Adjusts FAQ heading, adds "Capture high availability troubleshooting data" heading, and rewrites troubleshooting entries for continuous config refresh and message posting/reload behaviours, splitting solutions by DB-backed config vs `config.json`.
Cross-reference and small doc anchor fixes `source/administration-guide/configure/environment-configuration-settings.rst`, `source/administration-guide/manage/statistics.rst`, `source/administration-guide/scale/backing-storage-benchmarks.rst`, `source/administration-guide/upgrade/enterprise-roll-out-checklist.rst`	Corrects Sphinx `:ref:` targets and anchor links to point to the updated high-availability anchors for AWS RDS guidance, Replica DB Conns, backing-storage testing notes, and the enterprise roll-out proxy-server checklist link.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: reorganizing the high-availability cluster-based deployment documentation.
Description check	✅ Passed	The description is directly related to the changeset, outlining the restructured sections and documentation improvements made.
Linked Issues check	✅ Passed	The pull request successfully addresses all key objectives from issue `#8811`: creates Preparation, Deployment guide, and Next steps sections; preserves Operations and maintenance, FAQ, and Troubleshooting sections; emphasizes PostgreSQL hot_standby settings; and guides toward database configuration.
Out of Scope Changes check	✅ Passed	All changes are within scope. The PR includes the main restructuring of high-availability documentation and necessary cross-reference updates in related documentation files to maintain consistency.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/issue-8811-20260310-1635

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

source/administration-guide/scale/high-availability-cluster-based-deployment.rst (2)
540-540: Defaulting NFS mounts to soft is risky for application data consistency.

The examples on Lines 540/548/563 use rw,soft,intr, while Line 573 only notes hard,intr as an alternative. For shared app data, soft can return I/O errors under transient network issues and may cause partial writes.
Suggested doc adjustment
- sudo mount -t nfs -o rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data
+ sudo mount -t nfs -o rw,hard,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data
...
- NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0
+ NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,hard,intr 0 0
Also applies to: 548-548, 563-563, 573-573
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
at line 540, The NFS mount examples use the risky option "rw,soft,intr" which
can cause I/O errors and partial writes; change the example mount options to use
"rw,hard,intr" (or show both options and explicitly recommend "hard,intr" for
application data) and add a brief explanatory note after the command that for
shared application data you should prefer hard mounts to avoid
transient-network-induced partial writes and data corruption; update every
occurrence of the example mount command (the string "sudo mount -t nfs -o
rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data") and add a
short caution sentence referencing "hard,intr" as the recommended setting.
709-709: Line 709 incorrectly claims RDS does not expose PostgreSQL configuration access.

Both Aurora and RDS PostgreSQL expose configuration through DB parameter groups and cluster parameter groups, allowing administrators to tune settings like memory allocation, replication parameters, and other PostgreSQL options. Reword to clarify these configuration mechanisms are available, rather than suggesting configuration is unavailable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
at line 709, Replace the incorrect sentence "Amazon RDS does not expose direct
PostgreSQL configuration access" with a clarified statement that both Amazon RDS
(for PostgreSQL) and Amazon Aurora expose configuration via DB parameter groups
and cluster parameter groups; mention that these parameter groups allow tuning
of memory, replication, and other PostgreSQL options and that monitoring should
still be done via CloudWatch and RDS Performance Insights (locate the sentence
containing "Amazon RDS does not expose direct PostgreSQL configuration access"
in high-availability-cluster-based-deployment.rst and update it to reference DB
parameter groups/cluster parameter groups and retained monitoring guidance).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 360-383: Clarify the health-check paragraph to explain the
difference between passive peer quarantining (controlled by max_fails=0) and
per-request active failover (controlled by proxy_next_upstream and similar
settings): explicitly state that setting max_fails=0 disables passive marking of
an upstream as unavailable, while NGINX can still perform per-request failover
using proxy_next_upstream so requests may be retried to other backends even when
peers are not quarantined; mention the Mattermost API ping endpoint
(http://SERVER_IP:8065/api/v4/system/ping) as a way to monitor server health but
note it does not change the distinction between passive and active failure
handling.

---

Nitpick comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Line 540: The NFS mount examples use the risky option "rw,soft,intr" which can
cause I/O errors and partial writes; change the example mount options to use
"rw,hard,intr" (or show both options and explicitly recommend "hard,intr" for
application data) and add a brief explanatory note after the command that for
shared application data you should prefer hard mounts to avoid
transient-network-induced partial writes and data corruption; update every
occurrence of the example mount command (the string "sudo mount -t nfs -o
rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data") and add a
short caution sentence referencing "hard,intr" as the recommended setting.
- Line 709: Replace the incorrect sentence "Amazon RDS does not expose direct
PostgreSQL configuration access" with a clarified statement that both Amazon RDS
(for PostgreSQL) and Amazon Aurora expose configuration via DB parameter groups
and cluster parameter groups; mention that these parameter groups allow tuning
of memory, replication, and other PostgreSQL options and that monitoring should
still be done via CloudWatch and RDS Performance Insights (locate the sentence
containing "Amazon RDS does not expose direct PostgreSQL configuration access"
in high-availability-cluster-based-deployment.rst and update it to reference DB
parameter groups/cluster parameter groups and retained monitoring guidance).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85b65ef1-a743-47fb-ae48-49f1d75d59fd

📥 Commits

Reviewing files that changed from the base of the PR and between 741ca93 and e5b2253.

📒 Files selected for processing (1)

source/administration-guide/scale/high-availability-cluster-based-deployment.rst

Combs7th · 2026-04-23T23:23:41Z

@neillcollie - Would you be able to help give this a technical review when you're able?

neillcollie

Massive uplift compared with previous HA docs.
Recently implemented an HA env and my notes would have been to include additional NFS details. This PR now comprehensively covers NFS

Resolved conflict in source/administration-guide/scale/high-availability-cluster-based-deployment.rst: adopted master's corrected HA sysctl values from #8939 (tcp_rmem/tcp_wmem max 2500000, net.core.{r,w}mem_max 16777216, dropped rmem_default/wmem_default/ tcp_mem) while keeping this branch's restructure into a sudo tee heredoc with the added vm.min_free_kbytes line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-26T21:27:47Z

Newest code from mattermost has been published to preview environment for Git SHA 8a50cb5

coderabbitai

♻️ Duplicate comments (1)

source/administration-guide/scale/high-availability-cluster-based-deployment.rst (1)
385-386: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify health check behavior in relation to max_fails=0 setting.

Line 385 states "NGINX automatically stops routing traffic to backend servers that fail to respond," which may mislead readers given the max_fails=0 setting on line 286-287. With max_fails=0, NGINX does not mark backends as unavailable based on failure count (passive quarantining is disabled), though per-request failover via proxy_next_upstream can still retry individual requests to other backends. Consider clarifying that the monitoring endpoint helps detect issues, but with max_fails=0, failed backends remain in the rotation unless removed manually or via external health checks.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
around lines 385 - 386, The guidance about NGINX removing unhealthy backends is
misleading given the existing upstream config using max_fails=0; update the
paragraph that mentions the Mattermost health endpoint
(http://SERVER_IP:8065/api/v4/system/ping) to clarify that with max_fails=0 the
upstream will not be passively marked down based on failure count, that
proxy_next_upstream can still retry individual requests, and recommend using
external active health checks or manual removal of nodes to take servers out of
rotation; reference the max_fails=0 and proxy_next_upstream settings so the
reader can correlate behavior with the config.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 385-386: The guidance about NGINX removing unhealthy backends is
misleading given the existing upstream config using max_fails=0; update the
paragraph that mentions the Mattermost health endpoint
(http://SERVER_IP:8065/api/v4/system/ping) to clarify that with max_fails=0 the
upstream will not be passively marked down based on failure count, that
proxy_next_upstream can still retry individual requests, and recommend using
external active health checks or manual removal of nodes to take servers out of
rotation; reference the max_fails=0 and proxy_next_upstream settings so the
reader can correlate behavior with the config.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0836be12-b550-4155-b2ae-a22d478236ba

📥 Commits

Reviewing files that changed from the base of the PR and between e5b2253 and 8a50cb5.

📒 Files selected for processing (1)

source/administration-guide/scale/high-availability-cluster-based-deployment.rst

Copilot

Pull request overview

This PR reorganizes the Mattermost high availability cluster deployment documentation around the admin workflow, adding preparation, deployment, next steps, and operations/maintenance sections while expanding guidance for file storage, proxy, database replicas, and production PostgreSQL settings.

Changes:

Adds workflow-oriented HA preparation and deployment guidance.
Expands NGINX, S3/NFS storage, PostgreSQL replica, and failover documentation.
Moves operational topics such as cluster discovery, jobs, plugins, upgrades, FAQ, and troubleshooting under maintenance-oriented sections.

-1. Back up your Mattermost database and the file storage location. See the :doc:`backup </deployment-guide/backup-disaster-recovery>` documentation for details.
-2. Modify your NGINX setup to remove the server. For information about this, see :ref:`proxy server configuration <deployment-guide/server/setup-nginx-proxy:manage the nginx process>` documentation for details.
-3. Open **System Console > Environment > High Availability** to verify that all the machines remaining in the cluster are communicating as expected with green status indicators. If not, investigate the log files for any extra information.
+   - **Non-Kubernetes deployments:** Follow the :doc:`Deploy Mattermost on Linux </deployment-guide/server/deploy-mattermost-on-linux>` instructions to install the same version of Mattermost on each additional server.


+      sudo systemctl restart mattermost
+
+9. **Verify cluster communication:** Open **System Console > Environment > High Availability** to verify that each server in the cluster is communicating as expected with green status indicators. If not, investigate the log files for additional information.




-.. note::
+7. **Verify proxy functionality:** Test access through the proxy using your configured domain name and verify traffic is distributed across backend servers by checking Mattermost server logs.



+3. Updating the Mattermost configuration to point to the new storage
+4. Verifying that all files are accessible
+
+


+
+If you have non-standard (i.e. complex) network configurations, then you may need to use the :ref:`Override Hostname <administration-guide/configure/environment-configuration-settings:override hostname>` setting to help the cluster nodes discover each other. The cluster settings in the config are removed from the config file hash for this reason, meaning you can have slightly different cluster configuration settings in high availability mode. The Override Hostname is intended to be different for each clustered node if you need to force discovery.
+
+If ``UseIpAddress`` is set to ``true``, it attempts to obtain the IP address by searching for the first non-local IP address (non-loop-back, non-localunicast, non-localmulticast network interface). It enumerates the network interfaces using the built-in go function `net.InterfaceAddrs() <https://pkg.go.dev/net#InterfaceAddrs>`_. Otherwise it tries to get the hostname using the `os.Hostname() <https://pkg.go.dev/os#Hostname>`_ built-in go function.


+
+   .. code-block:: text
+
+      NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0


+
+   .. code-block:: text
+
+      192.168.1.100:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0


+5. **Configure TLS:** For production deployments, configure TLS on your NGINX proxy. See :doc:`Set up TLS </deployment-guide/server/setup-tls>` for detailed instructions on configuring TLS with NGINX. You can either use Let's Encrypt for automatic certificate management or provide your own TLS certificates.

-Use the :ref:`read replica <administration-guide/configure/environment-configuration-settings:read replicas>` feature to scale the database. The Mattermost server can be set up to use one master database and one or more read replica databases.
+6. **Configure health checks:** NGINX automatically stops routing traffic to backend servers that fail to respond. You can monitor server health using the Mattermost API endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` which returns ``Status 200`` for healthy servers.


+   .. code-block:: bash
+
+      # On the replica server, create base backup from primary
+      sudo -u postgres pg_basebackup -h PRIMARY_IP -D /var/lib/postgresql/data -U replication_user -P -v -R -X stream
+
+   The ``-R`` flag automatically creates the ``standby.signal`` file and configures replication settings.
+


+      # On the replica server
+      sudo -u postgres pg_ctl promote -D /var/lib/postgresql/data
+


- NFS mounts: switch shared data directory examples from rw,soft,intr to rw,hard to prevent partial writes / file corruption during transient NFS outages. Drop deprecated intr option (no-op since kernel 2.6.25). Invert the note to frame hard as the safe default and soft as a documented availability tradeoff. - NGINX health checks: rewrite item 6 to accurately describe what max_fails=0 does (disables passive quarantining), note that per-request failover via proxy_next_upstream still works, and direct admins to external active health checks against /api/v4/system/ping. - Fix broken :doc: target deploy-mattermost-on-linux -> deploy-linux. - Section rename anchor fixes: update the URL fragment in enterprise-roll-out-checklist.rst (#proxy-server-configuration -> #proxy-server) and the :ref: targets in backing-storage-benchmarks.rst, environment-configuration-settings.rst, and manage/statistics.rst to match the new "File storage" and "Database" headings. - PostgreSQL replica/promote commands: replace the hardcoded /var/lib/postgresql/data with a DATA_DIR placeholder plus inline platform-specific paths (Ubuntu/Debian: /var/lib/postgresql/{version}/main, RHEL/CentOS: /var/lib/pgsql/{version}/data). Note that pg_basebackup requires the target data directory to be empty. - Fix UseIpAddress -> UseIPAddress casing for consistency with the canonical ClusterSettings.UseIPAddress key used elsewhere in the doc. - Correct the Amazon RDS / Aurora PostgreSQL configuration claim: both expose configuration through DB parameter groups (RDS) and DB cluster parameter groups (Aurora). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-26T22:01:11Z

Newest code from mattermost has been published to preview environment for Git SHA 23d46a2

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

source/administration-guide/scale/high-availability-cluster-based-deployment.rst (2)
385-386: ⚡ Quick win

Simplify the health check explanation for novice administrators.

The current explanation uses technical jargon ("passive quarantining", "proxy_next_upstream") without context, which may confuse administrators who are not deeply familiar with NGINX internals. The sentence is also dense, combining multiple concepts.

As per coding guidelines, evaluate documentation through the lens of Novice Nate—a novice IT Administrator with 1-2 years of experience who wants to understand commands before running them. Define technical terms briefly inline on first use and explain the 'why' behind settings when it matters for reader confidence.
📝 Suggested clearer explanation
-6. **Configure health checks:** The upstream block above sets ``max_fails=0``, which disables NGINX's passive quarantining - backends are not marked unavailable based on failed-request counts. Individual failed requests are still retried against other backends via NGINX's default ``proxy_next_upstream`` behavior. To detect failed servers and remove them from rotation, monitor the Mattermost API endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` (which returns ``Status 200`` for healthy servers) using an active health check at your load balancer or monitoring system.
+6. **Configure health checks:** The upstream block above sets ``max_fails=0``, which tells NGINX to keep sending requests to all backend servers even if some requests fail. This is appropriate for high availability because Mattermost servers can handle brief errors without needing to be removed from rotation. If a request to one server fails, NGINX automatically retries it on another server in the cluster. To proactively monitor server health and detect servers that are truly down, configure your load balancer or monitoring system to check the Mattermost API health endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` (which returns ``Status 200`` when healthy).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
around lines 385 - 386, Rewrite the paragraph that explains health checks to be
simpler and beginner-friendly: briefly define the setting max_fails=0 (disables
passive failure-based removal) and proxy_next_upstream (NGINX will try another
backend on request failure), state why those defaults matter (they prevent
automatic removal of unhealthy backends), and then give clear actionable
guidance to use an active health check against the endpoint
http://SERVER_IP:8065/api/v4/system/ping (returns HTTP 200 when healthy) so load
balancers or monitoring systems can detect and remove failed servers; keep
sentences short, avoid jargon, and define each technical term inline on first
use.
711-711: ⚡ Quick win

Provide more specific guidance or links for Aurora parameter group tuning.

This paragraph mentions parameter groups and tuning but doesn't provide actionable guidance on when tuning is needed or what to monitor for. For Novice Nate, "tune memory, replication, and other settings" is too vague to act on, and for Veteran Vince, "generally well-tuned" doesn't specify which workload characteristics might require tuning.

As per coding guidelines, explain the 'why' behind settings when it matters for reader confidence, and provide links to relevant documentation when referencing external systems.
📋 Suggested improvement

Consider adding:

A link to AWS documentation on RDS parameter groups

Brief guidance on specific metrics to monitor (e.g., CPU, connection count, query latency)

When to consider tuning (e.g., "if you see high CPU utilization or query timeouts")

Example:
Amazon RDS and Aurora expose PostgreSQL configuration through `DB parameter groups <https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html>`_. Aurora's defaults are well-tuned for most workloads, but you should monitor CPU utilization, connection count, and query latency using Amazon CloudWatch and RDS Performance Insights. If you consistently see high CPU usage (>80%) or connection saturation, consider tuning ``max_connections`` or memory-related parameters in the parameter group.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
at line 711, Update the paragraph about Aurora/RDS parameter groups to include a
link to AWS parameter group docs, list specific metrics to monitor (e.g., CPU
utilization, connection count, query latency, read/write IOPS, replica lag) and
give concrete thresholds/triggers for tuning (e.g., CPU >80%, sustained high
query latency, connection saturation or replica lag) and mention common
parameters to adjust (e.g., max_connections, work_mem, shared_buffers,
wal_level) so readers know what to monitor and when to edit the DB parameter
group; reference the existing sentence about "DB parameter groups" / "Aurora's
defaults" to replace the vague phrase with these actionable items and the AWS
docs link.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@source/administration-guide/configure/environment-configuration-settings.rst`:
- Line 789: Update the misspelled visible link text "high availablility" to
"high availability" in the ref link(s) where the phrase appears; locate the ref
role usages like ":ref:`high availablility database configuration
<administration-guide/scale/high-availability-cluster-based-deployment:database>`"
and correct the visible label to ":ref:`high availability database configuration
<administration-guide/scale/high-availability-cluster-based-deployment:database>`"
in all three occurrences (around lines referenced) so the visible link text and
searches use the correct spelling.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 990-994: Add a verification step before the promotion command to
ensure the replica is fully caught up: instruct the reader to run a replication
status check on the replica (using pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn() and comparing them, e.g., checking that
pg_last_wal_replay_lsn() = pg_last_wal_receive_lsn() evaluates to true) and wait
until synced is true before running the promotion command; then keep the
existing promotion instruction that runs sudo -u postgres pg_ctl promote -D
DATA_DIR and add an expected-success note indicating the replica should become
primary and accept writes.
- Around line 761-769: The step that runs pg_basebackup lacks an explicit
warning and prerequisites; update the "Create the replica" section to (1) add an
important notice that the operation will replace the replica's data directory
and that backups should be taken, (2) list prerequisites (stop PostgreSQL on the
replica, verify network connectivity to PRIMARY_IP, ensure DATA_DIR is empty,
and have a backup of any existing data), (3) break the procedure into atomic
numbered steps: stop PostgreSQL (e.g. systemctl stop postgresql), confirm/clear
DATA_DIR, then run the pg_basebackup command shown (referencing pg_basebackup,
PRIMARY_IP, DATA_DIR, replication_user, and the -R flag), and (4) add a short
note explaining that -R creates standby.signal and configures replication.

---

Nitpick comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 385-386: Rewrite the paragraph that explains health checks to be
simpler and beginner-friendly: briefly define the setting max_fails=0 (disables
passive failure-based removal) and proxy_next_upstream (NGINX will try another
backend on request failure), state why those defaults matter (they prevent
automatic removal of unhealthy backends), and then give clear actionable
guidance to use an active health check against the endpoint
http://SERVER_IP:8065/api/v4/system/ping (returns HTTP 200 when healthy) so load
balancers or monitoring systems can detect and remove failed servers; keep
sentences short, avoid jargon, and define each technical term inline on first
use.
- Line 711: Update the paragraph about Aurora/RDS parameter groups to include a
link to AWS parameter group docs, list specific metrics to monitor (e.g., CPU
utilization, connection count, query latency, read/write IOPS, replica lag) and
give concrete thresholds/triggers for tuning (e.g., CPU >80%, sustained high
query latency, connection saturation or replica lag) and mention common
parameters to adjust (e.g., max_connections, work_mem, shared_buffers,
wal_level) so readers know what to monitor and when to edit the DB parameter
group; reference the existing sentence about "DB parameter groups" / "Aurora's
defaults" to replace the vague phrase with these actionable items and the AWS
docs link.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: df6df8a1-1040-44af-8e1d-0c3ba18d61ce

📥 Commits

Reviewing files that changed from the base of the PR and between 8a50cb5 and 23d46a2.

📒 Files selected for processing (5)

source/administration-guide/configure/environment-configuration-settings.rst
source/administration-guide/manage/statistics.rst
source/administration-guide/scale/backing-storage-benchmarks.rst
source/administration-guide/scale/high-availability-cluster-based-deployment.rst
source/administration-guide/upgrade/enterprise-roll-out-checklist.rst

✅ Files skipped from review due to trivial changes (2)

source/administration-guide/upgrade/enterprise-roll-out-checklist.rst
source/administration-guide/manage/statistics.rst

coderabbitai · 2026-05-26T22:05:55Z

+4. **Create the replica** using ``pg_basebackup``. Stop PostgreSQL on the replica and ensure the data directory is empty before running this command.
+
+   .. code-block:: bash
+
+      # On the replica server, create base backup from primary.
+      # Replace DATA_DIR with the platform-specific data directory:
+      #   Ubuntu/Debian: /var/lib/postgresql/{version}/main
+      #   RHEL/CentOS: /var/lib/pgsql/{version}/data
+      sudo -u postgres pg_basebackup -h PRIMARY_IP -D DATA_DIR -U replication_user -P -v -R -X stream


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add explicit backup warning and clarify prerequisites for replica creation.

Creating a replica with pg_basebackup is a destructive operation that requires stopping PostgreSQL and ensuring an empty data directory. The current documentation mentions this casually but doesn't emphasize the risk or provide a clear prerequisite checklist.

As per coding guidelines, list prerequisites clearly, use numbered atomic steps for procedures, and explain the 'why' behind commands when it matters for reader confidence. For a destructive operation like this, Novice Nate needs explicit warnings and a clear sequence.

⚠️ Suggested improvement

Consider restructuring step 4 to make prerequisites explicit:

4. **Create the replica** using ``pg_basebackup``. .. important:: This operation will completely replace the replica's data directory. Ensure you have: - A backup of any existing data on the replica server - Verified network connectivity from the replica to the primary server - Stopped PostgreSQL on the replica server On the replica server, stop PostgreSQL and clear the data directory: .. code-block:: bash sudo systemctl stop postgresql # Ensure the data directory is empty # Ubuntu/Debian: /var/lib/postgresql/{version}/main # RHEL/CentOS: /var/lib/pgsql/{version}/data Create the base backup from the primary: .. code-block:: bash # Replace DATA_DIR with your platform-specific data directory # Replace PRIMARY_IP with your primary server's IP address sudo -u postgres pg_basebackup -h PRIMARY_IP -D DATA_DIR -U replication_user -P -v -R -X stream The ``-R`` flag automatically creates the ``standby.signal`` file and configures replication settings.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/administration-guide/scale/high-availability-cluster-based-deployment.rst` around lines 761 - 769, The step that runs pg_basebackup lacks an explicit warning and prerequisites; update the "Create the replica" section to (1) add an important notice that the operation will replace the replica's data directory and that backups should be taken, (2) list prerequisites (stop PostgreSQL on the replica, verify network connectivity to PRIMARY_IP, ensure DATA_DIR is empty, and have a backup of any existing data), (3) break the procedure into atomic numbered steps: stop PostgreSQL (e.g. systemctl stop postgresql), confirm/clear DATA_DIR, then run the pg_basebackup command shown (referencing pg_basebackup, PRIMARY_IP, DATA_DIR, replication_user, and the -R flag), and (4) add a short note explaining that -R creates standby.signal and configures replication.

coderabbitai · 2026-05-26T22:05:55Z

+      # On the replica server.
+      # Replace DATA_DIR with the platform-specific data directory:
+      #   Ubuntu/Debian: /var/lib/postgresql/{version}/main
+      #   RHEL/CentOS: /var/lib/pgsql/{version}/data
+      sudo -u postgres pg_ctl promote -D DATA_DIR


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add verification step before promoting replica to primary.

Promoting a replica to primary without verifying it has fully caught up with replication can result in data loss. The current instructions don't include a verification step to check replication lag before promotion.

As per coding guidelines, evaluate documentation through the lens of Novice Nate and include expected output or success checks after key steps to help readers verify progress. For a critical operation like database failover, Veteran Vince would expect verification guidance.

🔍 Suggested verification step

Add a verification step before the promotion command:

1. **Verify replica is synchronized** before promoting: On the replica server, check replication status: .. code-block:: sql SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_wal_replay_lsn() = pg_last_wal_receive_lsn() AS synced; Wait until ``synced`` returns ``true``, indicating the replica has applied all received WAL data. 2. **Promote the replica** to primary: For self-managed PostgreSQL: .. code-block:: bash # On the replica server. # Replace DATA_DIR with the platform-specific data directory: # Ubuntu/Debian: /var/lib/postgresql/{version}/main # RHEL/CentOS: /var/lib/pgsql/{version}/data sudo -u postgres pg_ctl promote -D DATA_DIR For Amazon RDS: Use the AWS Console or CLI to promote the read replica to a standalone instance.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/administration-guide/scale/high-availability-cluster-based-deployment.rst` around lines 990 - 994, Add a verification step before the promotion command to ensure the replica is fully caught up: instruct the reader to run a replication status check on the replica (using pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn() and comparing them, e.g., checking that pg_last_wal_replay_lsn() = pg_last_wal_receive_lsn() evaluates to true) and wait until synced is true before running the promotion command; then keep the existing promotion instruction that runs sudo -u postgres pg_ctl promote -D DATA_DIR and add an expected-success note indicating the replica should become primary and accept writes.

Pre-existing typo flagged by CodeRabbit while reviewing the heading-rename ref-target updates from the previous commit. Three occurrences in the visible link label of cross-references to the HA deployment docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-26T22:19:25Z

Newest code from mattermost has been published to preview environment for Git SHA e413b59

roberson-io requested a review from neillcollie March 10, 2026 17:17

Merge branch 'master' into claude/issue-8811-20260310-1635

e5b2253

coderabbitai Bot requested changes Mar 13, 2026

View reviewed changes

Comment thread source/administration-guide/scale/high-availability-cluster-based-deployment.rst Outdated

neillcollie approved these changes May 6, 2026

View reviewed changes

esethna added the Guidance label May 6, 2026

Copilot AI review requested due to automatic review settings May 26, 2026 21:23

Copilot started reviewing on behalf of roberson-io May 26, 2026 21:24 View session

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Copilot AI reviewed May 26, 2026

View reviewed changes

coderabbitai Bot requested changes May 26, 2026

View reviewed changes

		sudo systemctl restart mattermost

		9. Verify cluster communication: Open System Console > Environment > High Availability to verify that each server in the cluster is communicating as expected with green status indicators. If not, investigate the log files for additional information.


		.. note::
		7. Verify proxy functionality: Test access through the proxy using your configured domain name and verify traffic is distributed across backend servers by checking Mattermost server logs.

		3. Updating the Mattermost configuration to point to the new storage
		4. Verifying that all files are accessible


		If you have non-standard (i.e. complex) network configurations, then you may need to use the :ref:`Override Hostname <administration-guide/configure/environment-configuration-settings:override hostname>` setting to help the cluster nodes discover each other. The cluster settings in the config are removed from the config file hash for this reason, meaning you can have slightly different cluster configuration settings in high availability mode. The Override Hostname is intended to be different for each clustered node if you need to force discovery.

		If ``UseIpAddress`` is set to ``true``, it attempts to obtain the IP address by searching for the first non-local IP address (non-loop-back, non-localunicast, non-localmulticast network interface). It enumerates the network interfaces using the built-in go function `net.InterfaceAddrs() <https://pkg.go.dev/net#InterfaceAddrs>`_. Otherwise it tries to get the hostname using the `os.Hostname() <https://pkg.go.dev/os#Hostname>`_ built-in go function.


		.. code-block:: text

		NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0


		.. code-block:: text

		192.168.1.100:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0

		# On the replica server
		sudo -u postgres pg_ctl promote -D /var/lib/postgresql/data

Conversation

roberson-io commented Mar 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 10, 2026

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

coderabbitai Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Combs7th commented Apr 23, 2026

Uh oh!

neillcollie left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

roberson-io commented Mar 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 13, 2026 •

edited

Loading