[07:24:13] hi DBAs, when checking the decom for the old IDP nodes I realised we probably have a stale mysql grant: https://github.com/wikimedia/operations-puppet/blob/production/modules/role/templates/mariadb/grants/clouddb.sql.erb#L36 [07:24:24] that IP is currently held by idp2002 [07:24:41] I'd just make a patch to drop it from the grants file, or am I missing something here? [07:26:01] that looks outdated in general [07:36:28] we don't have grant automation yet so you'll have to drop those yourself if they are still in place. but the patch is more than welcome! [07:40:33] erratum: we* will drop those :D sorry for this [07:45:08] after some discussion with Simon this is in fact clarified, this was for the IDPs to be able to retrieve U2F records (which are now obsoleted by webauthn), so these can in facat legitimate go away [07:46:35] so this was probably juts a stale comment which still referrred to labspuppet via copy&pasta [09:28:54] I've accidentally depooled db1165, it's repooling [09:34:52] db2101 crashed? [09:36:15] looks like it rebooted [09:36:31] has* [09:37:04] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter.service on db2202:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:11] arnaudb: "The server could not be powered on or a server critical error occurred" [09:43:41] Uncorrectable Machine Check Exception + DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 9) [09:43:55] oof indeed, seems like a good reason to crash [09:44:49] we will decomission it [09:45:22] I just set up db2201 [09:46:22] https://phabricator.wikimedia.org/T362311 [10:00:29] I forgot to upgrade xtrabackup (wmf-mariadb106) on dbprov2004 to 10.6.17, so the prepare failed [11:51:04] moritzm: I've just discussed with taavi and they're telling me that this can be probably dropped on galera, you confirm it's obsolete and OK to be discarded? [11:55:35] no objections! [11:56:00] thanks :) [12:45:45] PROBLEM - MariaDB sustained replica lag on s6 on db2129 is CRITICAL: 51.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2129&var-port=9104 [12:46:45] RECOVERY - MariaDB sustained replica lag on s6 on db2129 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2129&var-port=9104 [13:20:48] PROBLEM - MariaDB sustained replica lag on m3 on db2134 is CRITICAL: 13.5 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2134&var-port=9104 [13:22:48] RECOVERY - MariaDB sustained replica lag on m3 on db2134 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2134&var-port=9104 [15:12:29] urandom: o/ [15:12:37] aqs1010's instances are running with PKI! [15:13:21] \o/ [15:14:16] So... by "running", we mean that they are using the new bundle, which also contains the old self-signed rootCa, yes? [15:16:39] running the new truststore with the bundle (self-signed + root PKI) and TLS cert from PKI (keystore) [15:17:00] right, sorry, that's what I meant [15:19:13] so yes, \o/ [15:23:29] nodetool statuses look good, now the only issue could come from clients [15:23:43] but there shouldn't be any, hopefully :D [15:23:55] ah, who cares about the clients? :) [15:23:55] next week if nothing goes on fire we can proceed with the rest, gradually of course [15:23:59] ahahhaahh yes yes [15:27:35] (I mean we don't do TLS verification in any client up to now so easy win) [17:04:11] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:11] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:33] FYI, the upcoming etcd maintenance I mentioned on 4/9 is unlikely to be scheduled for next week. feel free to plan schema changes touching dbctl. I'll follow up here when we have a more concrete window.