[09:02:04] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:03:24] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:08:34] (SystemdUnitFailed) firing: user@11984.service on bast1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:34] (SystemdUnitFailed) resolved: user@11984.service on bast1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:26] volans, XioNoX: What's the status of https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/981472 [10:28:33] It still needs some love, but Ganeti took all my attention [10:28:36] there are a few outstanding comments from Riccardo on the patch still [10:28:43] most minor [10:29:03] XioNoX: would you object if I took it and tried to address the outstanding issues? [10:29:23] Hugh was reimaging hosts in rack A3 yesterday and had problems [10:29:36] He was just trying to reimage them and leave them on the old vlan [10:30:13] which isn't working, and can't unless/until the gateway is moved to the switches. but tbh we don't really want to have people reimaging back onto those old vlans so no loss [10:32:52] topranks: go for it, I don't see any big outstanding comment, the only one is really minor, clearing the PTR DNS cache [10:33:01] yeah [10:34:01] the more major one is if we should issue "reboot" instead of "systemctl restart networking" [10:34:06] for the re-image mode I think it's all "theoretically" done, but never tested, so it needs testing and fixing whatever shows up and polishing [10:34:30] you sorted the argparse thing I think [10:34:46] for the in-place there is probably much more testing needed, I don't think the sed commands will work out of the box [10:35:29] I'd recommend focusing on the re-image mode, as it will be the cleanest [10:36:46] oh, now I remember, there is also https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/979121 and https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/979040 as prerequisites [10:37:23] XioNoX, topranks: following up to from a few days ago, I made a patch to create a new role for routed Ganeti which switches the existing hosts over to it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003605 [10:38:30] XioNoX: yeah the reimage mode is definitely the main one we want [10:39:00] that also addresses the current shortcoming - not being able to reimage hosts connected to lsw's on legacy vlan [10:39:17] moritzm: thanks, looks great [10:39:59] topranks: I'm seeing the end of the ganeti tunnel, so I might be able to look at those patches later today or tomorrow [10:40:14] XioNoX: ok to merge/deploy right away or do you prefer to wait until the current set of changes has landed? [10:40:26] moritzm: anytime it fine [10:40:35] k, doing that now then [10:40:40] I don't think they will conflict [10:47:04] XioNoX: good news on ganeti! don't worry about the host vlan move stuff I can pick it up, it's not super urgent [10:47:08] merged (except 2033, which has Puppet disabled currently, but there's no rush) [11:48:56] 10netops, 10Infrastructure-Foundations, 10SRE: BGP peering from LSW to K8s hosts using loopback IP not IRB - https://phabricator.wikimedia.org/T357619 (10cmooney) Just a bit more background, I discovered this looking at a tcpdump, this is //lsw1-a4-codfw// trying to establish BGP to //mw2383//: ` 11:10:04.59... [11:57:31] sorry was in meeting, anything you need from me? [12:07:34] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:11:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1031.eqiad.wmnet with OS bookworm [12:58:34] (DiskSpace) firing: Disk space idp1002:9100:/ 5.11% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:01:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1031.eqiad.wmnet with OS bookworm com... [13:46:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: openstack: nova refuses to admit a compute node after a reimage - https://phabricator.wikimedia.org/T357631 (10aborrero) [14:04:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: openstack: nova refuses to admit a compute node after a reimage - https://phabricator.wikimedia.org/T357631 (10aborrero) https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html [14:11:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [14:12:12] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10cmooney) 05Open→03Resolved a:03cmooney All looking good, closing task. Thanks everyone for their assistance. [14:13:01] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1064005 let's see where that goes [14:24:59] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox custom validator: don't require a cable ID on "planned" cables - https://phabricator.wikimedia.org/T357259 (10ayounsi) Tested and works as expected. [14:25:08] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox custom validator: don't require a cable ID on "planned" cables - https://phabricator.wikimedia.org/T357259 (10ayounsi) 05Open→03Resolved [14:57:39] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) In theory if all those patches are merged/deployed, the VM will be using /32 IPs from early_command.sh all the way to its final state and... [15:16:14] 10SRE-tools, 10Infrastructure-Foundations: Decommission cookbook: lock per switch - https://phabricator.wikimedia.org/T353513 (10ayounsi) Yeah that would work too but might not be worth it as the cookbook main role is to run `configure_switch_interfaces()` and might be refactored in {T344326} [15:49:48] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dc8a2b8d-561d-404c-ac7f-f64637c16dd1) set by cmooney@cumin... [15:53:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:58:37] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=23a82a8c-672f-4105-8a05-0b7dbbb4cb97) set by cmooney@cumin... [16:08:36] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) Started to write the doc over there : https://wikitech.wikimedia.org/wiki/Ganeti#Routed_Ganeti [16:11:13] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10cmooney) All moves now complete, ports up on new switch and all devices pinging ok! [16:11:56] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10ABran-WMF) amazing, thanks @cmooney! will start repooling [16:13:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [16:58:49] (DiskSpace) firing: Disk space idp1002:9100:/ 4.817% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:05:51] looks like we lost compression of log files ^ [17:07:02] I don't see any logrotate rules [17:24:13] jhathaway: need a hand? [17:25:35] thanks cdanis but I don't think so, unless you know off hand whether we should have a logrotate rules, it appears we stopped rotating and compressing at some point, but I am not sure why, perhaps we intended to keep cas logs forever? [17:25:44] that doesn't sound right [17:26:46] seems like log4j2 can do all the things, give the right set of xml incantations [17:26:47] yeah I don't know offhand but I'd be very surprised if we intended to keep logs forever [17:27:09] okay, well I try to cut a patch, and see if any other folks no any history [17:27:14] *I'll [18:18:34] (DiskSpace) resolved: Disk space idp1002:9100:/ 5.87% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:23:34] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:34] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed