[07:47:31] db2097's data was recovered yesterday, saying this as a reminder of what I mentioned on our meeting [08:32:30] PROBLEM - MariaDB sustained replica lag on s2 on db2189 is CRITICAL: 52.25 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2189&var-port=9104 [08:32:38] PROBLEM - MariaDB sustained replica lag on s3 on db2190 is CRITICAL: 44.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2190&var-port=9104 [08:33:36] PROBLEM - MariaDB sustained replica lag on s1 on db2188 is CRITICAL: 36.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2188&var-port=9104 [08:33:36] PROBLEM - MariaDB sustained replica lag on s6 on db2180 is CRITICAL: 90.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2180&var-port=9104 [08:33:38] RECOVERY - MariaDB sustained replica lag on s3 on db2190 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2190&var-port=9104 [08:34:30] RECOVERY - MariaDB sustained replica lag on s2 on db2189 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2189&var-port=9104 [08:34:38] RECOVERY - MariaDB sustained replica lag on s1 on db2188 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2188&var-port=9104 [08:35:38] RECOVERY - MariaDB sustained replica lag on s6 on db2180 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2180&var-port=9104 [13:29:59] marostegui: the issue with the reimage for db2137 and es2026 is due to an issue we've found for servers connected to those new switches [13:30:19] we're currently looking at our options to resolve it so the normal process works as expected [13:30:38] topranks: I just commented https://phabricator.wikimedia.org/T357951#9563402 [13:30:54] topranks: So that means all the hosts being migrated cannot be reimaged? [13:33:14] no we can reimage them [13:33:35] we just need to make some adjustments to allow for it [13:33:58] I gather changing the IPs of the hosts is not an option? The simplest way forward is to reimage with a new IP [13:33:58] topranks: Yeah, but changing the VLAN and changing the IP isn't just a minor adjustment for us [13:34:45] alternately we need to temporarily change dhcp on the core router side to allow it work from the lsw, which I can do for you [13:34:49] ok [13:35:15] topranks: My next question would...should we really continue with the migration if this is something we are going to maybe need for more hosts? [13:35:35] Like changing IPs is something we need to plan for [13:36:17] once the migration is completed we won't have the issue - the problem is supporting simultaneously reimage from hosts on the old switches (where CR needs to do DHCP), and new switches (where the connected switch needs to do it) [13:36:33] noted about the IPs changing [13:36:37] ah ok I see, that's good [13:37:02] we deliberately moved the servers prior to changing their IPs, to give teams time to plan those changes [13:37:23] we should have tested for the transitority state with regards to reimage [13:37:34] anyway bear with me a few mins and we can try those two again [13:37:48] topranks: thanks I appreciate the help [13:38:00] nah sorry for the hassle here this is on us [13:38:02] topranks: Also with once the migration is completed you mean the racks involved here https://phabricator.wikimedia.org/T355544 or all the rows? [13:38:35] yeah all racks in codfw row A and B [13:38:43] gotcha thanks [13:38:49] and ultimately we'll want to re-ip all those hosts, but we can plan that no rush https://phabricator.wikimedia.org/T354869 [13:39:04] yeah, re-ip all those hosts is going to be "fun" [13:39:48] I have added our team to https://phabricator.wikimedia.org/T354878 [13:39:55] Because this is likely going be a KR [13:39:58] (for us) [13:40:06] As it is quite some tedious work [13:40:14] I'll discuss with the team [13:40:43] ok, well like I say we don't want to make life difficult for people, hence we plan to continue to support the legacy vlans for as long as they are needed [13:40:57] something that should probably be automated (the re-numbering) [13:41:46] volans: I agree, especially all the dbctl related changes [13:41:49] which is very error prone [13:42:13] We obviously will need to switch all the masters in both dcs, so we can reimage them [13:46:58] indeed [13:47:17] marostegui: do you mind if I kick off the reimage for db2137 when ready? just to double check it goes ok/catch any issues at DHCP stage? [13:47:27] you can go for it anytime you want yeah [13:48:00] idelly it would be nice to have dbctl support in spicerack/cookbooks, I don't recall how usable as a library is dbctl right away or if it needs modifications [13:49:05] marostegui: thanks, was there any particular flags you were passing to the cookbook? [13:49:20] nah, just --os bookworm [13:49:25] but you'll need "--new" now too [13:49:28] ok cool thanks [13:50:23] PROBLEM - MariaDB sustained replica lag on s6 on db1180 is CRITICAL: 92.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1180&var-port=9104 [13:51:14] host was downtimed 🤔 anyway, alert fires after it's caught on [13:52:23] RECOVERY - MariaDB sustained replica lag on s6 on db1180 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1180&var-port=9104 [14:00:45] topranks: I see es2026 replies to ping, so I guess the fix worked? [14:01:26] no I haven't actually kicked it off, so I guess it just rebooted off the hard disk without reimaging [14:01:38] Ah yeah, it was already up [14:02:15] sry I'm going quite slowly and double-checking a lot of things here as it's a little delicate to add the new lsw interfaces, but that's a once-off [14:02:22] no worries [14:02:27] no rush [14:22:55] marostegui: I kicked off reimage for es2026, but I need to tell it what puppet version to use? [14:23:19] go for 5 [14:24:23] ok thanks [14:25:33] for the other, if the host is still up with the old OS and puppet runs,it should be back in puppetdb and you shouldn't need the --new on the reimage [14:26:32] I can kick the other one if you want topranks [14:27:18] I only added the vlan ints on row A for now, db2137 is in row B so not quite yet [14:27:30] ooook! [15:10:11] es2026 looking good! [15:15:09] marostegui: yep and network seems happy too :) [15:15:35] I'm just about to do db2137, puppet5 for that also? [15:15:39] yes please [15:15:40] thanks [15:16:12] cool, also while I have you, db2106 and db2146 are in rack A8 we're moving switch in later today [15:16:19] did you manage to depool trhose? [15:16:25] I believe arnaudb did? [15:16:44] will do at 16:45 [15:16:53] (my tz) [15:16:58] sweet thanks! [15:42:25] all good on my end topranks, I anticipated a bit the depooling [15:42:47] great, thanks guys! [16:08:54] arnaudb: those hosts moved now and looking ok [16:08:59] topranks: db2137 looking good! [16:09:15] ok great! sorry for the hassle [16:09:47] thanks for taking the time to fix it [16:09:48] row A is now done so we won't have the issue there [16:09:50] I'll close the task now [16:10:45] row B I'll leave as is - meaning it'll work for migrated hosts - if a host connected to old swithc is reimaged we can temporary re-enable dhcp relay on the CRs [16:11:03] amazing thanks topranks :) will start repooling [22:37:48] (PuppetZeroResources) firing: Puppet has failed generate resources on dbproxy1026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:42:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on db1238:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:47:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1238:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:52:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1238:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:57:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on db1201:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:02:48] (PuppetZeroResources) firing: Puppet has failed generate resources on db1185:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:02:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on db1201:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:07:48] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on db1201:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:17:49] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on db1185:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:22:48] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1168:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:27:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1168:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:32:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on db1230:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:32:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1168:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:37:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1230:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:37:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on db1168:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:42:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on db1230:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:47:48] (PuppetZeroResources) resolved: (3) Puppet has failed generate resources on db1230:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:47:53] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on db1168:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:52:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on db1185:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:57:48] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on db1161:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:58:04] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on db1161:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources