[06:49:30] jynus: let me know when I can stop m2 on db1217 [07:08:23] any time now [07:08:34] \o/ [07:08:35] thanks [07:11:43] So I got the error: pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'db1164.eqiad.wmnet' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))") [07:11:54] at dbprov2001 [07:12:25] Uh??? [07:13:09] this was the backup test after migrating dbprov2001 to puppet 7 [07:15:00] the backup itself worked, it just couldn't gather metadata [07:19:18] either that or the primary db switch, of course [07:22:50] I'd guess it was the primary switch [07:22:54] But can you retry? [07:23:41] I don't think that will work- I prefer to retry with different tls options [07:23:54] yeah, whatever you think it is best [07:24:23] even if it is connecting to the proxy with ssl-verfy=no [07:25:33] or using a different ca [07:25:49] I will update the package to allow me to update that with puppet [07:25:57] and see which of the 2 work [08:05:01] marostegui: I am ccing you as I will be out on monday https://phabricator.wikimedia.org/T351491 [08:10:56] I need to restart, return you in a sec [08:46:15] what's the mysql config for the ca? [08:46:40] ssl-ca ? [08:56:04] hey sorry jynus I was focusing on something and didn't check irc [08:56:45] jynus: yeah ssl-cal, which was recently changed by jbond btw but I am not sure if that could affect you [08:56:57] it was done a few days ago, but it is true that mariadb wasn't restarted till yesterday [08:57:12] we did some restarts as testing and nothing broke, so I wonder if it did break for backups [08:57:39] jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/968667 [08:58:04] yeah, it is not updated for backups [08:58:43] but maybe if the host was reimaged, it is not compatible with the old ca and only with the new? [08:58:51] yeah the host was reimaged [09:13:17] yes, I confirm it was that [09:13:48] the new package fixes that- and move the config to puppet, so if something else changes, a new package will not be needed [09:14:28] with this I think I will also be able to go back to using the proxy [09:15:12] but I am not touching that for now, this is already a rushed change, will leave that for other time [09:15:32] I will upload the new package, and update puppet to fix the most important issue, will refine later [09:17:19] I will still accept a sanity check for the UBN at https://gitlab.wikimedia.org/repos/sre/wmfbackups/-/merge_requests/4 from arnaudb or Amir1 [09:25:08] Thanks jynus [09:35:40] thanks arnaudb! [09:35:47] my pleasure! [09:36:09] I am going to pull with additional changes (manpage, etc) [09:36:19] feel free to ping me back, I've subscribed [09:36:35] jynus: just to confirm that dose sound like its caused by cert changes. [09:36:48] if you update to us /etc/ssl/certs/wmf-ca-certificates.crt for the ca it should fix things [09:37:02] yes, I tested it and it worked [09:37:20] on future changes mysql config will be on puppet so it will be easier to change [09:37:29] good good sorry i missed it, cool [09:39:16] I will be deploying this so backups can continue during weekend and until I come back on tuesday [09:39:53] but please don't update backupmon1001 to puppet7 yet, as I will have to do a similiar patch for backup checking [09:40:07] I will do that next week [09:40:21] jynus: ack ill leave eveyryhitng untill you give me the go ahead [10:27:35] marostegui: with some additional work, we will transition the hiera key at https://gerrit.wikimedia.org/r/c/operations/puppet/+/975231/1/hieradata/role/common/dbbackups/content.yaml to point to m1-master (so there is no longer special treatment); but I don't want to touch that now [10:28:03] I will be waiting for the ongoing s1 backup test to succeed to merge and upload [10:28:14] right yeah, we can merge once you are back [10:28:31] i will review the whole patch though [10:28:36] yeah, focusing for now on keeping them running as they were before [10:54:13] I will be deploying the schema changes to backup1, expect some lag there (will downtime, but for orchestrator) [11:10:52] alter tables fly when the host is dedicated and I can stop writes :-D [12:14:43] It's working again: http://localhost:8000/dbbackups/jobs/25705/ [15:47:59] (PuppetFailure) firing: Puppet has failed on dbprov2004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:48:14] (PuppetFailure) firing: Puppet has failed on dbprov2004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:35:08] PROBLEM - MariaDB sustained replica lag on s4 on db1238 is CRITICAL: 13.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [21:35:24] PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [21:35:42] PROBLEM - MariaDB sustained replica lag on s4 on db1155 is CRITICAL: 7.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1155&var-port=13314 [21:39:50] RECOVERY - MariaDB sustained replica lag on s4 on db1155 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1155&var-port=13314 [21:40:38] RECOVERY - MariaDB sustained replica lag on s4 on db1238 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [21:40:54] RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [23:48:14] (PuppetFailure) firing: Puppet has failed on dbprov2004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure