[01:12:36] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db2110 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2110&var-port=9104
[01:14:06] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db2110 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2110&var-port=9104
[01:21:30] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db2110 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2110&var-port=9104
[01:23:00] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db2110 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2110&var-port=9104
[06:37:09] <marostegui>	 I have switched es1 master in eqiad (which is a NOOP)
[09:03:34] <moritzm>	 it seems to me that the tooling issues with Puppet 7 and the DBs are sorted by now. ok to proceed with moving the rest of the mariadb::core_test role to Puppet 7?
[09:03:41] <moritzm>	 2 out of 4 servers are already using it
[09:10:55] <marostegui>	 moritzm: yep
[09:19:09] <moritzm>	 ok, I'm switching these now
[09:27:18] <marostegui>	 I am deplying a schema change on the old s4 eqiad master
[09:27:29] <marostegui>	 I have depooled it
[09:32:53] <moritzm>	 mariadb::core_test has been moved to Puppet 7, i.e. that db2102 and db1125  are now also using it
[10:19:22] <Emperor>	 I'd be grateful for a review/+1 of https://gerrit.wikimedia.org/r/c/operations/puppet/+/998305 please? Removing now-drained nodes from the eqiad rings
[10:24:57] <marostegui>	 I just did it Emperor 
[10:26:00] <Emperor>	 Thanks :)
[10:40:02] <moritzm>	 looking at other mariadb roles and Puppet 7: pc1014 is on Puppet 7 since Nov 20, is the mariadb::parsercache role also good to migrate or would you prefer additonal canary/canaries first?
[10:40:31] <marostegui>	 yeah, parsercache can be migrated
[10:43:46] <moritzm>	 ok, I'll switch it in ~ 10m
[10:56:16] <marostegui>	 godog: I have merged your heartbeat change - ran puppet manually on a master and all looking good. Stopped and started the service manually just to verify and all fine!
[10:57:21] <godog>	 marostegui: neat, thank you! appreciate it
[11:13:43] <moritzm>	 mariadb:parsercache is switched to Puppet 7 now
[11:38:45] <jynus>	 btullis: I know you are generally quite busy, having to handle so many different things, but I would like to thank you personally for taking care of T316655 so promptly. That makes me happy :-D
[11:38:47] <stashbot>	 T316655: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655
[11:49:46] <btullis>	 jynus: You're welcome, and thanks for the kind words.
[11:53:04] <jynus>	 no action for backups today (re:network maintenance). I think the only thing is the swift depool
[12:46:44] <jynus>	 the backups using db2097.codfw.wmnet failed tonight
[12:49:10] <jynus>	 ERROR - xtrabackup version mismatch- xtrabackup version: {'major': '10.6', 'minor': 15, 'vendor': 'MariaDB'}, backup version: {'major': '10.6', 'minor': 16, 'vendor': 'MariaDB'}
[12:49:39] <jynus>	 It is as if having good logs makes debugging better (!)
[12:50:35] <jynus>	 I just need to upgrade mariadb on dbprov2004, continue existing backups
[13:49:16] <marostegui>	 I am going to restart zarcillo so, orchestrator won't be available for a few minutes
[13:53:10] <marostegui>	 back
[15:12:58] <marostegui>	 db2194 is down?
[15:13:34] <marostegui>	 ah yes it is arnaud's testing
[15:15:04] <topranks>	 Emperor: how are we looking for the network maintenance in codfw rack A2?
[15:15:25] <topranks>	 did you have a chance to depool the swift nodes and thanos-fe2001?
[15:15:33] <Emperor>	 topranks: my Cunning Plan was to do the depools at 15:45 UTC if that's OK with you? I'd rather not depool for longer than necessary
[15:18:18] <topranks>	 absolutely that is fine, I'm just going through pre-checks making sure everything is in order 
[15:18:45] <topranks>	 Emperor: actually do you know anything about moss-be2001?
[15:20:52] <topranks>	 it's in that rack, but I'd not tagged any team as owning it, I see you mentioned in some related phab tasks 
[15:21:11] <topranks>	 I can't ssh to it, nor login as root on serial console though, so unsure what state it's in 
[15:22:34] <Emperor>	 topranks: yeah, that's not currently in service
[15:22:50] <topranks>	 Emperor: nevermind, it's primary link is down as things stand 
[15:23:03] <topranks>	 ok thanks, we'll go ahead and move it cheers 
[15:23:19] <Emperor>	 Yeah, please do. I think it's due a reimage in my Copious Free Time
[15:26:11] <marostegui>	 topranks: confirm https://phabricator.wikimedia.org/T355862#9521413 we are not doing this rack today but tomorrow right?
[15:29:06] <topranks>	 marostegui: correct, we are doing rack A2 today, no db hosts there 
[15:29:12] <marostegui>	 topranks: thanks
[15:29:21] <topranks>	 np
[16:12:21] <topranks>	 Emperor: fwiw moss-be2001 cable was moved and it now has a working network connection (previously it was enabled on the switch but down?  possibly a bad cable replaced)
[16:12:35] <topranks>	 I can't ssh though, I think it does need a reimage 
[16:13:57] <topranks>	 Emperor: as for the depooled swift/thanos nodes you can repool them now 
[16:14:00] <topranks>	 thanks
[16:15:03] <Emperor>	 topranks: doing so, thanks