[01:48:05] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11466974 (10Papaul) a:05Papaul→03ayounsi @ayounsi assigned back to you since you are working on it. thanks [04:51:29] 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11467201 (10Papaul) I took a quick look at this before getting the support ticket going on. On lsw1-e2-codfw we have ` Frame length statistics for m... [08:58:04] 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11467826 (10ayounsi) My guess is that SR-Linux < 25 doesn't have stats for mgmt0 (either not implemented yet or a bug), with the upgrade we've started... [10:25:40] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11468080 (10ayounsi) We can set the rule now as non-paging to start collecting data and test it. So we can gain trust in it before... [11:40:36] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11468298 (10fgiunchedi) >>! In T384052#11462541, @cmooney wrote: > > https://grafana.wikimedia.org/goto/YOk1qBMDg > > In terms of... [14:20:22] moritzm: thanks for the quick review on that patch! [14:20:36] I'll re-test this afternoon, probably it won't make a difference and we can revert [14:21:24] yeah, but it's worth a shot [14:32:26] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11468772 (10cmooney) >>! In T384052#11468080, @ayounsi wrote: > We can set the rule now as non-paging to start collecting data and... [15:00:57] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11468976 (10ayounsi) [15:29:35] moritzm: that timeout increase didn't seem to work, I updated the task there, it still seems to be waiting only 3 seconds [15:29:54] it does seem to fetch the var we tried to set, not sure if you've any ideas [15:33:09] let me poke around in netcfg [15:34:54] thanks, don't worry about it too much <3 [15:42:44] the variable seems fine, but I'm wondering if the syntax is at fault, the rest of the settings in trixie.cfg is tab-separated, while this one is whitespace-separated [15:43:48] not sure if it really makes a difference, but we could try https://gerrit.wikimedia.org/r/1219174 [15:44:22] the value read from the variale gets internally multiplied by 4, so this would be two minutes in practice [15:46:08] moritzm: yeah could be, worth a shot anyway [15:46:22] +1 for the patch, I'll merge and retry in a little while [15:47:31] sounds good [15:47:50] if that also fails, the only thing I can imagine is some bug in detecting the link itself [15:48:09] d-i doesn't use ethtool, but some internal, minimised implementation called ethtool_lite [15:48:31] although I'd be surprised if something as mundane as detecting a link could fail [15:50:58] yeah not sure, I can't seem to find that in the busybox shell but you see it in the logs [15:51:14] one thing I do notice, if I enable the link manually when it comes up, is this in syslog: [15:51:16] Dec 17 15:21:18 kernel: [ 220.831297] tg3 0000:01:00.0 eno3: Link is up at 1000 Mbps, full duplex [15:51:49] I _don't_ see that in the log at the time the installer is trying to bring the link up itself, which maybe is evidence the link genuinely is not coming up [15:57:35] but when you brought it up manually, that was with the standard ip tooling, right? [15:59:37] it's annoying that d-i is still stuck on the original design assumption that d-i needs to be tiny, simply using the default tooling instead of busybox and udebs would avoid so many traps and cornercases [16:09:13] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 10MediaWiki-extensions-EmailAuth, and 3 others: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047#11469444 (10pmiazga) This is still happening, ~3.3k per in last 4 weeks: {F71105510... [16:16:39] moritzm have you ever tried https://fai-project.org/ ? That's what the Debian cloud team uses to build its images. I'm gonna try and set it up on my homelab when I have time [16:17:26] university of Cologne represents [16:18:45] The demo looks really good, but what demo doesn't ;P? [16:19:01] I haven't used it myself [16:19:28] but Thomas Lange, the main author knows what he's doing, I'm sure it's also solid in practice [16:19:29] was in a talk about FAI what feels like 20 years ago. because it comes from this professor in Cologne and I used to hang out at their data center [16:19:37] +1 [16:21:28] Netbox v4.5.0 beta 1 released - https://github.com/netbox-community/netbox/releases/tag/v4.5.0-beta1 [16:21:57] main useful feature for us, maybe, is the new "owner" feature : https://github.com/netbox-community/netbox/issues/20304 [17:06:07] 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11469757 (10Papaul) Ticket 05304338 has been submitted with Nokia [17:33:38] 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11469916 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4ac5ae06-34f5-425c-b0df-bc77a3758cd3) set by cmooney@cumin1003 for 2:00:0... [17:48:41] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [17:51:15] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11469994 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ec73e489-e95a-4824-ad67-a99943eae0e7) set by cmoone... [17:51:43] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470001 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=98bc0d0a-c3e1-4862-b66a-e386322de608) set by cmoone... [18:15:18] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470088 (10cmooney) >>! In T412733#11467826, @ayounsi wrote: > My guess is that SR-Linux < 25 doesn't have stats for mgmt0 (eit... [18:23:55] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470106 (10cmooney) @papaul lswtest-d8-eqiad is upgraded to v25.10.1 now for you. {F71107154 width=500} [18:48:41] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [18:54:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11470232 (10Jhancock.wm) if we use 1G copper, we don't need to order anything. I can probably get it pre-ran tomorrow. Then papaul or I can conne... [18:56:34] FIRING: DiskSpace: Disk space serpens:9100:/ 6.088% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:26:34] FIRING: DiskSpace: Disk space serpens:9100:/ 3.327% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:27:00] 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470334 (10Papaul) We are seeing the same error on lswtest-d8 in eqiad ` in-error-packets 2466 ` [20:18:53] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 10MediaWiki-extensions-EmailAuth, and 3 others: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047#11470491 (10Etonkovidova) Last 24 hours - 127 instances of `message: Could not s... [20:32:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:02:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:11] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 10MediaWiki-extensions-EmailAuth, and 3 others: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047#11470891 (10Tgr) >>! In T383047#11469444, @pmiazga wrote: > What would be the next... [22:17:25] FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:42:25] FIRING: [5x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:26:49] FIRING: DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace