[01:12:36] PROBLEM - MariaDB sustained replica lag on s4 on db2110 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2110&var-port=9104 [01:14:06] RECOVERY - MariaDB sustained replica lag on s4 on db2110 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2110&var-port=9104 [01:21:30] PROBLEM - MariaDB sustained replica lag on s4 on db2110 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2110&var-port=9104 [01:23:00] RECOVERY - MariaDB sustained replica lag on s4 on db2110 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2110&var-port=9104 [06:37:09] I have switched es1 master in eqiad (which is a NOOP) [09:03:34] it seems to me that the tooling issues with Puppet 7 and the DBs are sorted by now. ok to proceed with moving the rest of the mariadb::core_test role to Puppet 7? [09:03:41] 2 out of 4 servers are already using it [09:10:55] moritzm: yep [09:19:09] ok, I'm switching these now [09:27:18] I am deplying a schema change on the old s4 eqiad master [09:27:29] I have depooled it [09:32:53] mariadb::core_test has been moved to Puppet 7, i.e. that db2102 and db1125 are now also using it [10:19:22] I'd be grateful for a review/+1 of https://gerrit.wikimedia.org/r/c/operations/puppet/+/998305 please? Removing now-drained nodes from the eqiad rings [10:24:57] I just did it Emperor [10:26:00] Thanks :) [10:40:02] looking at other mariadb roles and Puppet 7: pc1014 is on Puppet 7 since Nov 20, is the mariadb::parsercache role also good to migrate or would you prefer additonal canary/canaries first? [10:40:31] yeah, parsercache can be migrated [10:43:46] ok, I'll switch it in ~ 10m [10:56:16] godog: I have merged your heartbeat change - ran puppet manually on a master and all looking good. Stopped and started the service manually just to verify and all fine! [10:57:21] marostegui: neat, thank you! appreciate it [11:13:43] mariadb:parsercache is switched to Puppet 7 now [11:38:45] btullis: I know you are generally quite busy, having to handle so many different things, but I would like to thank you personally for taking care of T316655 so promptly. That makes me happy :-D [11:38:47] T316655: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 [11:49:46] jynus: You're welcome, and thanks for the kind words. [11:53:04] no action for backups today (re:network maintenance). I think the only thing is the swift depool [12:46:44] the backups using db2097.codfw.wmnet failed tonight [12:49:10] ERROR - xtrabackup version mismatch- xtrabackup version: {'major': '10.6', 'minor': 15, 'vendor': 'MariaDB'}, backup version: {'major': '10.6', 'minor': 16, 'vendor': 'MariaDB'} [12:49:39] It is as if having good logs makes debugging better (!) [12:50:35] I just need to upgrade mariadb on dbprov2004, continue existing backups [13:49:16] I am going to restart zarcillo so, orchestrator won't be available for a few minutes [13:53:10] back [15:12:58] db2194 is down? [15:13:34] ah yes it is arnaud's testing [15:15:04] Emperor: how are we looking for the network maintenance in codfw rack A2? [15:15:25] did you have a chance to depool the swift nodes and thanos-fe2001? [15:15:33] topranks: my Cunning Plan was to do the depools at 15:45 UTC if that's OK with you? I'd rather not depool for longer than necessary [15:18:18] absolutely that is fine, I'm just going through pre-checks making sure everything is in order [15:18:45] Emperor: actually do you know anything about moss-be2001? [15:20:52] it's in that rack, but I'd not tagged any team as owning it, I see you mentioned in some related phab tasks [15:21:11] I can't ssh to it, nor login as root on serial console though, so unsure what state it's in [15:22:34] topranks: yeah, that's not currently in service [15:22:50] Emperor: nevermind, it's primary link is down as things stand [15:23:03] ok thanks, we'll go ahead and move it cheers [15:23:19] Yeah, please do. I think it's due a reimage in my Copious Free Time [15:26:11] topranks: confirm https://phabricator.wikimedia.org/T355862#9521413 we are not doing this rack today but tomorrow right? [15:29:06] marostegui: correct, we are doing rack A2 today, no db hosts there [15:29:12] topranks: thanks [15:29:21] np [16:12:21] Emperor: fwiw moss-be2001 cable was moved and it now has a working network connection (previously it was enabled on the switch but down? possibly a bad cable replaced) [16:12:35] I can't ssh though, I think it does need a reimage [16:13:57] Emperor: as for the depooled swift/thanos nodes you can repool them now [16:14:00] thanks [16:15:03] topranks: doing so, thanks