[00:03:14] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) I did another wmf-reimage cookbook run on this host and the installation finished, including the grub install. I can't explain why it wouldn't work... [00:03:49] (SystemdUnitFailed) resolved: dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:51] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10Dzahn) [00:04:21] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10Dzahn) 05Open→03In progress [00:16:48] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10Dzahn) The host should be usable now: ` [ncmonitor1001:~] $ uptime 00:15:21 up 1 min, 1 user, load average: 0.15, 0.04,... [00:37:00] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10RoySmith) Based on https://meta.wikimedia.org/wiki/User:EBarrios_(WMF), it would appear that Eliza Barrios is in charge of the grou... [08:33:12] XioNoX: should we use a dedicated role for routed ganeti, like role::ganeti_routed? this makes it simpler to keep them apart in Puppet/Cumin. The underlying logic is wrapper in the profiles anyway, so it's just an additional role [08:33:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [08:34:21] like the ganeti-test? [08:35:27] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) 05Open→03Resolved a:03cmooney Closing - thanks all for the help! [08:35:38] yeah. Cathal and myself were both a little confused about 2034 until we realised it's a new routed test server :-) [08:37:46] how confused? [08:37:49] but yeah feel free to [08:41:12] ack, I'll add a role later [08:41:54] until we realised these were the routed servers, we thought the VLAN provisioning went wrong during installation [08:43:27] ok [08:46:34] hey. [08:46:51] I just got confused - I seen a ganeti host on a per-rack vlan and assumed we might ahve problems [08:46:59] without realising at first they were the routed-mode ones [08:56:40] :D [09:05:26] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Draining ganeti2023.codfw.wmnet of running VMs [09:38:14] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Draining ganeti2024.codfw.wmnet of running VMs [10:24:39] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10klausman) [10:34:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [12:17:09] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Extend "test-cookbook" to support wmcs-cookbooks - https://phabricator.wikimedia.org/T345069 (10taavi) 05Open→03Resolved a:03taavi [15:31:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:31:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) 05Stalled→03Open In a 2024-02-14 network sync meeting we decided to continue moving older cloudvirts into the new single NI... [15:51:30] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9a43620e-deca-432c-aa1f-5d6e939b51bc) set by cmooney@cumin... [16:06:45] 10netops, 10Infrastructure-Foundations, 10cloud-services-team, 10User-aborrero: clouddb: evaluate moving them into cloud-private - https://phabricator.wikimedia.org/T357543 (10aborrero) [16:07:29] 10netops, 10Infrastructure-Foundations, 10cloud-services-team, 10User-aborrero: clouddb: evaluate moving them into cloud-private - https://phabricator.wikimedia.org/T357543 (10aborrero) p:05Triage→03Medium [16:07:46] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ec1ab967-b8f5-4bfd-914e-e76afe369468) set by cmooney@cumin... [16:08:59] 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129 (10thcipriani) [16:11:07] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10BCornwall) 05In progress→03Resolved a:03BCornwall I'm still not sure where the problem lies and am concerned that this... [16:14:57] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10cmooney) All links moved and all devices pinging ok again. [16:16:09] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ABran-WMF) awesome, will start repooling, thanks @cmooney [16:36:05] 10netops, 10DBA, 10DC-Ops, 10Infrastructure-Foundations, and 4 others: Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 (10jijiki) [16:48:21] 10netops, 10DBA, 10DC-Ops, 10Infrastructure-Foundations, and 4 others: Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 (10jijiki) [17:08:29] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10jhathaway) @RoySmith I also asked ITS if we could use phabricator to communicate, since it is accessible by volunteers. [17:39:49] (PuppetZeroResources) firing: Puppet has failed generate resources on puppetserver2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:19:48] 10netops, 10DBA, 10DC-Ops, 10Infrastructure-Foundations, and 5 others: Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 (10lmata) [19:24:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on puppetserver2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:33:34] (SystemdUnitFailed) firing: (2) sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:34] (SystemdUnitFailed) firing: (2) sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:34] (SystemdUnitFailed) firing: (2) sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:34] (SystemdUnitFailed) firing: (3) puppetserver.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:34] (SystemdUnitFailed) firing: (3) puppetserver.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:04] ^ bounced sync-puppet-ca.service (unblocked by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003486) and following that puppetserver.service, let's see if that fixes it [20:09:30] 2003 is not in the SVC records, so no prod impact currently [20:13:34] (SystemdUnitFailed) resolved: (3) puppetserver.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:04:34] (DiskSpace) firing: Disk space idp1002:9100:/ 5.971% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:12:04] 10Mail, 10Infrastructure-Foundations, 10MW-1.42-notes (1.42.0-wmf.18; 2024-02-13), 10User-notice: Stop sending change notification email if edit is done by a bot - https://phabricator.wikimedia.org/T356984 (10IKhitron) Well, I've just tried four most significant scenarios, and it works fine. [22:23:33] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:34] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:34] (DiskSpace) resolved: Disk space idp1002:9100:/ 5.909% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=idp1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace