Feature/external deployment replication#109
Conversation
this enables the required configuration to use a deployment of this release as a replica source or target. to enable gtid based replication, the server_ids across all nodes nned to be unique to achieve this the templates will render server_id using the bosh provided `spec.index` which should be unique across all instances. In case the rendered instance is a replica source the id will start counting from the lowest possible value ( which is `1` because setting server_id to `0` deactivates replication for the node.. ). in case of rendering a replica target, the indexing will start at the highest possible value, this should ensure that the server_id across each part of the replica setup do not overlap.
- handle initial dump - handle refresh dumps when we find that new tables are added ( is there a better way? just the single tables?) - log replication status and debug info if found that something is wrong
- more explicit logging around the current state - add post start to determine if the bpm process succesfully started up by parsing it's log output fly-by: - fix double `--defaults-file` param in `source_dump` var.
|
Thanks! I will try to take a look a this and provide some feedback by Monday. |
c85066e to
1f0db8a
Compare
fixes:
```
1) db_init template specifying users via various spec properties generates a valid db_init file
Failure/Error: expect(rendered_template).to eq File.read(File.join(dir, "db_init_all_features"))
expected: "SET @@session.sql_log_bin = off;\n\nDROP USER IF EXISTS 'root'@'localhost';\nDROP USER IF EXISTS 'ro... PROXY ON ''@'' TO 'special-admin-user'@'%' WITH GRANT OPTION;\n\nSET @@session.sql_log_bin = on;\n"
got: "SET @@session.sql_log_bin = off;\n\nDROP USER IF EXISTS 'root'@'localhost';\nDROP USER IF EXISTS 'ro... PROXY ON ''@'' TO 'special-admin-user'@'%' WITH GRANT OPTION;\n\nSET @@session.sql_log_bin = on;\n"
(compared using ==)
Diff:
@@ -12,6 +12,7 @@
DROP USER IF EXISTS 'root'@'%';
DROP USER IF EXISTS 'roadmin'@'%';
+
CREATE USER IF NOT EXISTS 'mysql-backup'@'localhost';
ALTER USER 'mysql-backup'@'localhost' IDENTIFIED WITH caching_sha2_password BY 'secret-backup-pw';
GRANT RELOAD, LOCK TABLES, REPLICATION SLAVE, REPLICATION CLIENT, /*!80001 BACKUP_ADMIN,*/ PROCESS ON *.* to 'mysql-backup'@'localhost';
# ./spec/pxc-mysql/db_init_spec.rb:147:in `block (3 levels) in <top (required)>'
```
1f0db8a to
a83d48a
Compare
abg
left a comment
There was a problem hiding this comment.
I reviewed this and left a trail of comments.
Beyond the concerns already noted, I would like to see some answer for Galera <-> Galera replication - even if that's simply "replica must only be a single node for now". I.e. I think it would be fine for a template in the pxc-replicator job to abort if the numbers of MySQL nodes != 1. Robustly handling cluster to cluster async replication could be interesting too, but would be much more work.
I would also encourage pxc-replicator to be written in a Go module that is more robust and easier to test and maintain than the current bash scripting.
| log "resyncing with full backup" | ||
| log $($self_mysql <<< "STOP REPLICA;") | ||
| log $($self_mysql <<< "RESET REPLICA;") | ||
| log $($self_mysql <<< "RESET MASTER;") |
There was a problem hiding this comment.
Just a note: RESET MASTER is deprecated in v8.4 and removed in MySQL v9, replaced with RESET BINARY LOGS AND GTIDS.
I understand this is likely used here since pxc-release still supports MySQL v8.0 which still requires this syntax.
There was a problem hiding this comment.
yeah, what's the preferred way of handling this?
I noticed while reading docs that they're still cleaning up the weirdo naming..
I tried already using the replica/source naming as far as possible but didn't know if it would be idempotent :/
https://dev.mysql.com/doc/refman/8.4/en/replication-howto-repuser.html
it seems that the permissions still does not have an alternative?
There was a problem hiding this comment.
Yeah, v8.0 (and earlier) only support RESET MASTER and v8.4 only supports RESET BINARY LOGS ...
We've typically dealt with this kind of thing with version checks:
if mysql_major_minor() == "8.0": # <- Delete branch when we don't care about old version anymore
do_legacy_sql_stuff()
else:
do_latest_sql_stuff()
endThat could potentially be done at template rendering time by looking at the p('mysql_version') property.
MySQL v8.0 is EOL (although we'll probably get one more patch release from Percona for v8.0.46), so there is an argument to only support v8.4 and onwards. Maybe guard against using this feature on v8.0?
# In some pxc-replicator template
<%- if p('mysql_version') == "8.0" -%>
raise "Unsupported <helpful error message>"
<%- end -%>There was a problem hiding this comment.
hmm. How'd that play with cf-depl?
It seems that 8.4 is still experimental cloudfoundry/cf-deployment@3870c60
I'm not sure if there are any reasons why it's still setup to default to 8.0 outside of being cautious.
I'm going to ask around what the plans for runtime are in regards to switching the default. For now I guess I'll try to find all relevant places and sprinkle some ifs
There was a problem hiding this comment.
My thinking here was that it could be reasonable to not support this replication feature for EOL MySQL versions to simplify things.
However this exposes that, as pxc-release maintainers, we have been bad netizens of cf-d and have not updated the default MySQL version to v8.4. I'll try to follow-up on that.
I think it's okay to add a version check to support both the old and new syntax for now.
|
|
||
| while [[ $SECONDS -lt <%= p('monit_startup_timeout') %> ]]; do | ||
| if grep "$(date +%Y-%m-%dT%H:%M).*replication healthy" /var/vcap/sys/log/pxc-replicator/pxc-replicator.stdout.log > /dev/null; then | ||
| log "replication reported healthy within the laset minute" |
| SECONDS=0 | ||
|
|
||
| while [[ $SECONDS -lt <%= p('monit_startup_timeout') %> ]]; do | ||
| if grep "$(date +%Y-%m-%dT%H:%M).*replication healthy" /var/vcap/sys/log/pxc-replicator/pxc-replicator.stdout.log > /dev/null; then |
There was a problem hiding this comment.
This seems a little brittle. I would probably actually query the replica and validate the IO_Thread and SQL_Thread are in a healthy state.
some old deployments may want to opt to use the replication feature without having native access to it. one example would be cf-deployment. Operators could opt to manually onboard their existing cf deployment (with an old version of the pxc-release) into a replica. but that requires patching the permissions of the roadmin user. providing this source_admin creds will patch the roadmin user to be usable for a replication setup. this also reworks the config to enable providing manual values instead of forcing link usage. rename the link in the pxc-job and the replicator job to be less confusing. this should help disambiguate manual linking in bosh manifests.
the link naming changed, the ops files did not adjust for that.
it seems that was unnecessary as pointed out in the review. the server_id need to be different in between source and target, not different for each node within source or target.
|
first of all, sorry for the horrible PR ( I'm going to excuse myself by having had to learn a lot about pxc and mysql) and second thank you for the in depth review. I marked this as draft for now. I set up ci to deploy the current state of the release and iterate on it until this is ready. As far as: should this be go code? Yes :). I think I misunderstood you when I asked in slack about There's one thing that I already can answer, my use case is definitely running an external single instance replica to offload backups from a cf-deployment cluster. I'd like to avoid trying to create the My suggestion would be this, I'd like to keep the Again, thank you for the feedback. |
this adds the property parsing and link property parsing to optionally enable encryption for the mysql connections
remove it
Yep, apologies for any confusing feedback. Some of my comments were "nice-to-have-for-the-future" and I didn't make that super obvious. I wouldn't expect a perfect implementation in this initial PR. Targeting just cf-deployment use cases, mysqldump is probably fine and a single replica is fine. I think requiring TLS for the MySQL replication channel (and mysqldump / provisioning channel) would be ideal - and at least TLS being a supported configuration. However, mutual TLS and other advanced use cases are not necessary.
Yep, I think that's fine for now. A Go implementation doesn't need to be part of this PR, but I think it's a good end state for testability and alignment with other tooling in the release. |
these are required
the outupt was already cached in the line before.
this files missed adjustments after it got created. it was supposed to set the disk size..
the bpm yml was copied over from pxc-mysql and missed removing some of the mounted paths. - certificates folde is not required at all. the mylogin.cf contains a socket for the local connection, so we still need sys/run for now.
we just need read. - /var/vcap/sys/run/pxc-mysql for socket access - /var/vcap/jobs/pxc-mysql/config mylogin.cnf - /var/vcap/jobs/pxc-replicator/certificates certs for remote / source tls
0fd7129 to
02c0ae7
Compare
as the review pointed out this makes more sense.
the upstream instance could have reset the permissions workaround because it got redeployed. in that scenarion the user should be updated again.
|
@abg I think I addressed all of the comments, let me know if the current setup is any better. |
abg
left a comment
There was a problem hiding this comment.
Apologies for the slow feedback. This is improved, but not deployable yet.
I've left some more feedback.
|
|
||
|
|
||
| <%- if_p('tls.client.ca','tls.client.certificate','tls.client.private_key') do -%> | ||
| SOURCE_SSL_CONFIG="SOURCE_SSL_CA = '/var/vcap/jobs/pxc-replicator/certificates/server-ca.pem', SOURCE_SSL_CERT = '/var/vcap/jobs/pxc-replicator/certificates/client-cert.pem', SOURCE_SSL_KEY = '/var/vcap/jobs/pxc-replicator/certificates/client-key.pem'" |
There was a problem hiding this comment.
I didn't think deeply about this during my initial review pass so up-front apologies for not surfacing this earlier.
These certificates are going to be read by the mysqld process running in the pxc-mysql bpm context. That job currently would not have access to the pxc-replicator job certificates without adjusting the bpm.yml.
Rather than doing this, I think maybe just configure certificates on the pxc-mysql job. This should be the point of the pxc-mysql tls.client.{certificate,key} properties. If configured then you can just set SOURCE_SSL_{CERT,KEY} = /var/vcap/jobs/pxc-mysql/certificates/client-{cert,key}.pem
There was a problem hiding this comment.
SOURCE_SSL_{CERT,KEY} is only needed if you want x509 authentication (mTLS).
Not necessarily bad a security perspective but the replication user would need to be specifically configured for it.
i.e.
CREATE USER $replicationUSER REQUIRE X509 .... ;
-- or
CREATE USER $replicationUSER REQUIRE SUBJECT '/CN=some-client-identity' ...;As I mentioned in an earlier comment, mutual TLS is not a requirement for an MVP but maybe a nice-to-have. If you need it, it would add it as an option to user creation.
| log "resyncing with full backup" | ||
| log $($self_mysql <<< "STOP REPLICA;") | ||
| log $($self_mysql <<< "RESET REPLICA;") | ||
| log $($self_mysql <<< "RESET MASTER;") |
There was a problem hiding this comment.
My thinking here was that it could be reasonable to not support this replication feature for EOL MySQL versions to simplify things.
However this exposes that, as pxc-release maintainers, we have been bad netizens of cf-d and have not updated the default MySQL version to v8.4. I'll try to follow-up on that.
I think it's okay to add a version check to support both the old and new syntax for now.
| args: [] | ||
| env: | ||
| PATH: <%= path.join(":") %> | ||
| <%- if_p('mysql_backup_username', 'mysql_backup_password' ,'host', 'port') do |user,pass,host| -%> |
There was a problem hiding this comment.
Ideally we would apply the principal of least privilege here. Otherwise this will get flagged by security scans.
Replication doesn't need full SELECT access on tables from the primary - it just needs the privilege to stream binary logs.
Similarly the backup user just needs minimal privileges for a backup, but not necessarily binlog streaming privileges.
|
|
||
|
|
||
| <%- if_p('tls.client.ca','tls.client.certificate','tls.client.private_key') do -%> | ||
| SOURCE_SSL_CONFIG="SOURCE_SSL_CA = '/var/vcap/jobs/pxc-replicator/certificates/server-ca.pem', SOURCE_SSL_CERT = '/var/vcap/jobs/pxc-replicator/certificates/client-cert.pem', SOURCE_SSL_KEY = '/var/vcap/jobs/pxc-replicator/certificates/client-key.pem'" |
There was a problem hiding this comment.
Worth noting that best practice would also set SOURCE_SSL_VERIFY_SERVER_CERT (hostname verification).
This would require the primary's TLS certificate to contain the bosh dns server name (provided by the link here) Bosh docs.
In the context of cf-d this is usually sql-db.service.cf.internal but would require explicit configuration.
| ] | ||
| flags = "" | ||
| if_p('tls.client.ca','tls.client.certificate','tls.client.private_key') do | ||
| flags = "--ssl-ca=/var/vcap/jobs/pxc-replicator/certificates/server-ca.pem --ssl-cert=/var/vcap/jobs/pxc-replicator/certificates/client-cert.pem --ssl-key=/var/vcap/jobs/pxc-replicator/certificates/client-key.pem" |
There was a problem hiding this comment.
Any reason not to set these flags via the my.cnf? (i.e. source{,.admin}.mysql.cnf)
| optional: true | ||
|
|
||
| provides: | ||
| - name: source |
There was a problem hiding this comment.
Could we reuse the mysql link and just expose more properties on it? It already has port and mysql_version
If we are pulling all this from the "primary" deployments link anyway, I was imagining something like:
# primary deployment manifest
- instance_groups:
- name: mysql
jobs:
- name: pxc-mysql
...
provides: mysql: {as: mysql, shared: true}
# replica deployment manifest
- instance_groups:
- name: mysql
jobs:
- name: pxc-replicator
...
consumes: source: {from: mysql, deployment: ((primary_deployment_name)) }Plus or minus some bosh syntax cleanup here.
Maybe a second link isn't so bad, but it adds some more maintenance overhead where if we expose another mysql job property we have to maintain it in two places going forward.
There was a problem hiding this comment.
My reasoning was mostly about being specific what the link type is used for and not wanting to introduce additional properties to existing consumers to avoid confusion. I have no strong feelings about either direction.
| engine_config.enable_replication_target: | ||
| description: 'When configuring a replica, this will use spec.index to populate server_id on each node starting at the highest possible index. Any value set for server_id will be discarded by enabling this setting' | ||
| default: false | ||
| engine_config.enable_replication_source: | ||
| description: 'When configuring a replica, this will use spec.index to populate server_id on each node starting at the lowest possible index. Any value set for server_id will be discarded by enabling this setting' | ||
| default: false |
There was a problem hiding this comment.
I don't think these are used anymore and should be cleaned up.
| this is here for compatibility with older deployments of the pxc-release. | ||
| technically, the `roadmin` user is viable as a replication user, but it misses `REPLICATION SLAVE` privilege. | ||
| providing the source_admin_username & password will ensure that the roadmin will updated with the missing permission | ||
| monit_startup_timeout: |
There was a problem hiding this comment.
With bpm this is probably not super useful - the pid gets dropped as soon as the job starts and monit will treat the job as healthy.
But this made me wonder if we even really need monit for this use case. See my comments on the monit template.
| <%- end | ||
| allowed_remote_backup_hosts='localhost' | ||
|
|
||
| if p('engine_config.enable_replication_source') |
There was a problem hiding this comment.
Maybe it makes sense to just allow allowed_remote_backup_hosts to be configurable in some way. This seems similar to remote_admin_access and doesn't necessarily need to be tied to replication.
Hi there, I'd like to be able to do replica setups with the release. Here's my idea, let me know what you think of the approach.
Feature or Bug Description
it adds a jobs and a few config options to enable cross deployment replication for the pxc release.
Motivation
I want continuous backups of my running pxc cluster
Related Issue
If this PR was first opened as an issue, please provide the link to that issue here.