[07:47:31] <jynus>	 db2097's data was recovered yesterday, saying this as a reminder of what I mentioned on our meeting
[08:32:30] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s2 on db2189 is CRITICAL: 52.25 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2189&var-port=9104
[08:32:38] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s3 on db2190 is CRITICAL: 44.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2190&var-port=9104
[08:33:36] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s1 on db2188 is CRITICAL: 36.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2188&var-port=9104
[08:33:36] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s6 on db2180 is CRITICAL: 90.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2180&var-port=9104
[08:33:38] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s3 on db2190 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2190&var-port=9104
[08:34:30] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s2 on db2189 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2189&var-port=9104
[08:34:38] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s1 on db2188 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2188&var-port=9104
[08:35:38] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s6 on db2180 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2180&var-port=9104
[13:29:59] <topranks>	 marostegui: the issue with the reimage for db2137 and es2026 is due to an issue we've found for servers connected to those new switches 
[13:30:19] <topranks>	 we're currently looking at our options to resolve it so the normal process works as expected 
[13:30:38] <marostegui>	 topranks: I just commented https://phabricator.wikimedia.org/T357951#9563402
[13:30:54] <marostegui>	 topranks: So that means all the hosts being migrated cannot be reimaged?
[13:33:14] <topranks>	 no we can reimage them 
[13:33:35] <topranks>	 we just need to make some adjustments to allow for it 
[13:33:58] <topranks>	 I gather changing the IPs of the hosts is not an option?  The simplest way forward is to reimage with a new IP
[13:33:58] <marostegui>	 topranks: Yeah, but changing the VLAN and changing the IP isn't just a minor adjustment for us
[13:34:45] <topranks>	 alternately we need to temporarily change dhcp on the core router side to allow it work from the lsw, which I can do for you
[13:34:49] <topranks>	 ok 
[13:35:15] <marostegui>	 topranks: My next question would...should we really continue with the migration if this is something we are going to maybe need for more hosts?
[13:35:35] <marostegui>	 Like changing IPs is something we need to plan for
[13:36:17] <topranks>	 once the migration is completed we won't have the issue - the problem is supporting simultaneously reimage from hosts on the old switches (where CR needs to do DHCP), and new switches (where the connected switch needs to do it)
[13:36:33] <topranks>	 noted about the IPs changing 
[13:36:37] <marostegui>	 ah ok I see, that's good
[13:37:02] <topranks>	 we deliberately moved the servers prior to changing their IPs, to give teams time to plan those changes 
[13:37:23] <topranks>	 we should have tested for the transitority state with regards to reimage 
[13:37:34] <topranks>	 anyway bear with me a few mins and we can try those two again 
[13:37:48] <marostegui>	 topranks: thanks I appreciate the help
[13:38:00] <topranks>	 nah sorry for the hassle here this is on us 
[13:38:02] <marostegui>	 topranks: Also with once the migration is completed you mean the racks involved here https://phabricator.wikimedia.org/T355544 or all the rows?
[13:38:35] <topranks>	 yeah all racks in codfw row A and B 
[13:38:43] <marostegui>	 gotcha thanks
[13:38:49] <topranks>	 and ultimately we'll want to re-ip all those hosts, but we can plan that no rush https://phabricator.wikimedia.org/T354869
[13:39:04] <marostegui>	 yeah, re-ip all those hosts is going to be "fun"
[13:39:48] <marostegui>	 I have added our team to https://phabricator.wikimedia.org/T354878
[13:39:55] <marostegui>	 Because this is likely going be a KR 
[13:39:58] <marostegui>	 (for us)
[13:40:06] <marostegui>	 As it is quite some tedious work
[13:40:14] <marostegui>	 I'll discuss with the team
[13:40:43] <topranks>	 ok, well like I say we don't want to make life difficult for people, hence we plan to continue to support the legacy vlans for as long as they are needed 
[13:40:57] <volans>	 something that should probably be automated (the re-numbering)
[13:41:46] <marostegui>	 volans: I agree, especially all the dbctl related changes
[13:41:49] <marostegui>	 which is very error prone
[13:42:13] <marostegui>	 We obviously will need to switch all the masters in both dcs, so we can reimage them
[13:46:58] <volans>	 indeed
[13:47:17] <topranks>	 marostegui: do you mind if I kick off the reimage for db2137 when ready?  just to double check it goes ok/catch any issues at DHCP stage?
[13:47:27] <marostegui>	 you can go for it anytime you want yeah
[13:48:00] <volans>	 idelly it would be nice to have dbctl support in spicerack/cookbooks, I don't recall how usable as a library is dbctl right away or if it needs modifications
[13:49:05] <topranks>	 marostegui: thanks, was there any particular flags you were passing to the cookbook?
[13:49:20] <marostegui>	 nah, just --os bookworm 
[13:49:25] <marostegui>	 but you'll need "--new" now too
[13:49:28] <topranks>	 ok cool thanks 
[13:50:23] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s6 on db1180 is CRITICAL: 92.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1180&var-port=9104
[13:51:14] <arnaudb>	 host was downtimed 🤔 anyway, alert fires after it's caught on
[13:52:23] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s6 on db1180 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1180&var-port=9104
[14:00:45] <marostegui>	 topranks: I see es2026 replies to ping, so I guess the fix worked?
[14:01:26] <topranks>	 no I haven't actually kicked it off, so I guess it just rebooted off the hard disk without reimaging 
[14:01:38] <marostegui>	 Ah yeah, it was already up
[14:02:15] <topranks>	 sry I'm going quite slowly and double-checking a lot of things here as it's a little delicate to add the new lsw interfaces, but that's a once-off 
[14:02:22] <marostegui>	 no worries
[14:02:27] <marostegui>	 no rush
[14:22:55] <topranks>	 marostegui: I kicked off reimage for es2026, but I need to tell it what puppet version to use?
[14:23:19] <marostegui>	 go for 5
[14:24:23] <topranks>	 ok thanks
[14:25:33] <volans>	 for the other, if the host is still up with the old OS and puppet runs,it should be back in puppetdb and you shouldn't need the --new on the reimage
[14:26:32] <marostegui>	 I can kick the other one if you want topranks 
[14:27:18] <topranks>	 I only added the vlan ints on row A for now, db2137 is in row B so not quite yet
[14:27:30] <marostegui>	 ooook!
[15:10:11] <marostegui>	 es2026 looking good!
[15:15:09] <topranks>	 marostegui: yep and network seems happy too :)
[15:15:35] <topranks>	 I'm just about to do db2137, puppet5 for that also?
[15:15:39] <marostegui>	 yes please
[15:15:40] <marostegui>	 thanks
[15:16:12] <topranks>	 cool, also while I have you, db2106 and db2146 are in rack A8 we're moving switch in later today 
[15:16:19] <topranks>	 did you manage to depool trhose?
[15:16:25] <marostegui>	 I believe arnaudb did?
[15:16:44] <arnaudb>	 will do at 16:45
[15:16:53] <arnaudb>	 (my tz)
[15:16:58] <marostegui>	 sweet thanks!
[15:42:25] <arnaudb>	 all good on my end topranks, I anticipated a bit the depooling 
[15:42:47] <topranks>	 great, thanks guys!
[16:08:54] <topranks>	 arnaudb: those hosts moved now and looking ok 
[16:08:59] <marostegui>	 topranks: db2137 looking good!
[16:09:15] <topranks>	 ok great!  sorry for the hassle 
[16:09:47] <marostegui>	 thanks for taking the time to fix it
[16:09:48] <topranks>	 row A is now done so we won't have the issue there 
[16:09:50] <marostegui>	 I'll close the task now
[16:10:45] <topranks>	 row B I'll leave as is - meaning it'll work for migrated hosts - if a host connected to old swithc is reimaged we can temporary re-enable dhcp relay on the CRs
[16:11:03] <arnaudb>	 amazing thanks topranks :) will start repooling
[22:37:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on dbproxy1026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:42:48] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on db1238:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:47:48] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1238:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:52:49] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1238:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:57:49] <jinxer-wm>	 (PuppetZeroResources) firing: (5) Puppet has failed generate resources on db1201:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:02:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on db1185:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:02:49] <jinxer-wm>	 (PuppetZeroResources) firing: (6) Puppet has failed generate resources on db1201:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:07:48] <jinxer-wm>	 (PuppetZeroResources) firing: (5) Puppet has failed generate resources on db1201:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:17:49] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on db1185:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:22:48] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1168:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:27:49] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1168:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:32:48] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on db1230:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:32:49] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1168:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:37:49] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1230:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:37:49] <jinxer-wm>	 (PuppetZeroResources) firing: (6) Puppet has failed generate resources on db1168:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:42:49] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1230:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:47:48] <jinxer-wm>	 (PuppetZeroResources) resolved: (3) Puppet has failed generate resources on db1230:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:47:53] <jinxer-wm>	 (PuppetZeroResources) firing: (6) Puppet has failed generate resources on db1168:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:52:49] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1185:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:57:48] <jinxer-wm>	 (PuppetZeroResources) firing: (7) Puppet has failed generate resources on db1161:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:58:04] <jinxer-wm>	 (PuppetZeroResources) firing: (7) Puppet has failed generate resources on db1161:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources