[08:44:33] elukey: hey hey! about https://phabricator.wikimedia.org/T371400#10315501 I'm wondering if it's not something getting stuck in the late_command.sh script [08:45:32] XioNoX: o/ I thought the same but didn't find anything.. [08:46:04] also I just noticed that the "box" was titled "Configuring puppet-agent" [08:47:09] maybe it is me being stupid, lemme check [08:48:06] yes ok I think I know what's happening, the disks are not in JBOD [08:48:23] I was convinced I did it [08:48:27] but I have missed it probably [08:49:53] and looks like late_command.sh ran properly as the v6 stanza is properly configured in e/n/i [08:50:41] elukey: but why would it hang? [08:51:18] I think it expects way more disks and it may end up outside the configured logic if it founds less [08:52:06] going to configure them and kick off another reimage [09:01:20] all right configured jbods and kicked off again reimage [09:56:04] XioNoX: with proper JBOD disks all good [09:56:30] nice! [09:56:32] late_command may be a little sensitive if it doesn't find what it expects [09:59:48] late command or partman? [10:02:38] the error in d-i stated "Configuring puppet-agent" [10:02:50] seemed more late_command related, but not 100% sure [10:06:20] elukey: that dialog is what I just saw on wikikube-ctrl1002 as well [10:07:05] ahahhahah [10:07:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on idp2004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:07:51] and I did not change disk config or anything...I also just hit enter and the process continued just fine [10:09:55] iirc it's possible to see where it stopped when switching to the logs tab [10:12:48] FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on idp1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:13:19] I'll take a look at the next reimage [10:13:49] is that and apt message? [10:14:08] we do apt-install puppet in late_command [10:15:13] Puppet is failing on the IDP hosts because the secrets for airflow-research has been added, but the patch for the service hasn't been configured yet. [10:15:17] merged [10:25:59] slyngs: I had unrelatedly addressed this a bit earlier: https://gerrit.wikimedia.org/r/c/labs/private/+/1090810 [10:26:28] ah, no you meant for the prod service, not PCC [10:26:34] Yeah, this one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1090803 [10:26:43] Should run on next puppet run [10:31:29] topranks: fyi all our BGP sessions in IX.BR flapped, but the link itself stayed up. Noticed it as it triggered a PacketVis alert. [10:31:40] not much we can do as long as it doesn't happen again [10:47:48] FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on idp1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:50:59] XioNoX: yeah, odd alright... hold-timer expired for them? [10:51:25] I guess that perhaps could be a local issue with our router but given they re-established (and only after 10 mins) more likely to be something on the exchange [10:51:45] agreed not much we can do if it remains stable [10:52:48] RESOLVED: [2x] PuppetZeroResources: Puppet has failed generate resources on idp1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:05:23] XioNoX | elukey: same "configuring puppet-agent" dialog on the next reimage...but I don't see anything interesting in the log tab [11:08:48] https://phabricator.wikimedia.org/P71025 [11:10:07] jayme: thx, indeed... [11:10:21] we might be able to add more verbose logging https://serverfault.com/a/1122050 [11:34:25] FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:25] RESOLVED: [2x] SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti1051:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - TODO - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [14:28:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - TODO - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [15:20:15] Okay, so the new Netbox alerting works. [15:20:38] nice! [15:21:12] Instead of "TODO" do we want to use: https://netbox.wikimedia.org/extras/scripts/ as a dashboard? [15:21:30] Or does it make sense to have one in Grafana, I suspect not. [15:22:27] slyngs: for runbook we should have something on the wiki [15:23:22] Sure, then move the jobs link to Dashboard, then people should have everything they need. [15:23:30] slyngs: https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert [15:23:46] then we can update the page with runbooks or further links [15:39:27] Like so: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1090875 [15:42:02] good for me! [15:58:21] We have a check_netbox_puppetdb_virtual but it doesn't seem to be scheduled in Netbox [15:58:51] https://netbox.wikimedia.org/extras/scripts/17/jobs/ [16:00:39] jayme, XioNoX - there was a setting left commented for the UEFI testing that caused the "Configured puppet-agent" dialog, the next reimages should be fine [16:01:13] elukey: cool, thanks (cc claime) [16:01:52] I didn't run into that issue surprisingly, but needed --force-dhcp-tftp for all of them [16:02:53] I had a chat with Riccardo earlier on, he suggested to add code to reimage that automatically inspects the NICs and if the 10G ones are available, it should set tftp only by itsefl [16:03:13] there was also a task from Brian proposing to make it enabled for all the reimages [16:03:29] we may need to get a decision on this, so people don't get confused now that we have a fix [16:05:11] I vote for autodetection if it's not too complex to get [16:05:37] I'm +1 for that approach too if it's achievable [16:05:58] also while you're in there see if the api exposes the number of functions and the ports on each function :P [16:06:16] functions? [16:07:29] I ask as it's in the ID_NET_NAME_PATH way of linux setting network device names [16:07:39] it's a pcie concept whereby one card can have multiple "functions" [16:07:46] originally like a graphics and sound card in one [16:08:04] more recently with things like SR-IOV on networking it can be used for fancy network stuff [16:08:26] the broadcom 2-port cards we use have two functions, and each function has a single port [16:09:07] which gives us the "f0np0" and "f1np1" at the end of some net names [16:09:18] ahhh those functions [16:10:23] anyway don't mind me lets not complicate any other work [16:24:18] 10netops, 06Infrastructure-Foundations, 10procurement, 06SRE: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778 (10ayounsi) 03NEW [16:33:12] 10netops, 06Infrastructure-Foundations, 10procurement, 06SRE: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10317697 (10ayounsi) [16:47:06] 10netops, 06Infrastructure-Foundations, 10procurement, 06SRE, 13Patch-For-Review: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10317783 (10RobH) [16:50:07] 10netops, 06Infrastructure-Foundations, 10procurement, 06SRE, 13Patch-For-Review: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10317801 (10RobH) I'll have the xconnect disconnected by remote hands during the cross-connect disconnection, putting in the cross c... [17:05:32] elukey: still seeing "Configured puppet-agent" dialog [17:24:56] jayme: maybe jhathaway is testing, let's see [17:26:59] elukey: yes indeed [17:27:32] I was about to say, puppet disabled on apt1002 :D [17:29:16] :D [17:39:51] 10netops, 06Infrastructure-Foundations, 06serviceops, 07Kubernetes: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790 (10akosiaris) 03NEW [18:26:47] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10318425 (10cmooney) 05Open→03Resolved @Jclark-ctr I've erased the config on all the old devices no... [18:28:59] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435#10318437 (10cmooney) @robh the migration work is now done, all that remains is to remove the old devices and any cables connecting... [18:36:38] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435#10318507 (10RobH) a:03Jclark-ctr I'd hand this over to either John or Valerie as ops-eqiad for them to remove any devices and ca... [18:41:23] 10netops, 06Infrastructure-Foundations, 06serviceops, 07Kubernetes: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10318533 (10cmooney) Polling Netbox to find what switch each of those are connec...