[08:55:05] hey folks [08:55:34] Hello other folk :-) [08:55:44] after a disk-hot-swap retry on ms-be2088 it seems that a reboot is still needed to clear the controller's state :( https://phabricator.wikimedia.org/T384003#10575652 [08:56:46] sigh [08:59:46] Not much point in the disk being hot-swappable then [09:01:39] depends on its temperature [09:01:45] it can still be hot and swappable :D [09:02:14] Lukewarm sappable disks [09:07:53] I asked a quote for a different controller, we'll see [09:11:28] Get a good one. [09:24:48] 10netops, 06Infrastructure-Foundations, 10ops-magru, 06SRE: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006#10578164 (10ayounsi) 05Open→03Resolved a:03ayounsi No more errors. [09:45:55] FIRING: MaxConntrack: Max conntrack at 81.76% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:50:09] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10578273 (10JMeybohm) >>! In T384731#10566953, @fgiunchedi wrote: >>>! In T384731#10563685, @ayounsi wr... [10:00:55] RESOLVED: MaxConntrack: Max conntrack at 81.68% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [11:02:37] 10netops, 06Infrastructure-Foundations, 06SRE: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10578486 (10cmooney) 05Open→03Resolved Gonna close this one at this point. All has been ok in eqiad and codfw since the increase in thread count last week - gaps are no... [12:20:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10578641 (10cmooney) >>! In T385217#10572967, @cmooney wrote: > DC-Ops folks Nokia reccomend trying to interrupt the grub bootlo... [13:45:58] 10netops, 06Infrastructure-Foundations, 06SRE: gNMIc connection not working for cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T387018#10578920 (10cmooney) >>! In T387018#10574426, @ayounsi wrote: > Enabling traceoptions shows a `no shared cipher` error on the switch : > ` > Feb 24 09:33:58 ssl_transp... [14:24:25] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10579051 (10fgiunchedi) >>! In T384731#10578273, @JMeybohm wrote: >>>! In T384731#10566953, @fgiunchedi... [14:44:40] 10netops, 06Infrastructure-Foundations: BGP peers with missing descriptions - https://phabricator.wikimedia.org/T387220 (10ayounsi) 03NEW p:05Triage→03Low [14:48:43] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10579181 (10ayounsi) >> And what happens if peer_descr is missing or empty ? > good question, in that c... [14:54:06] elukey: we got another puppet5 case from vgutierrez : https://www.irccloud.com/pastebin/hrOmdOxR/ [14:54:21] did you manage to look at the one signaled by dcops yesterday? [14:55:38] nope [14:55:58] that's weird, clearly something has changed recently [14:56:42] that's for lvs7003 [14:57:35] should we find something like "The cookbook is now forcing Puppet" in the logs? I can't find anything on both cumin nodes for reimage [14:58:01] I'm assuming that cookbook will follow the user input [14:58:02] in the dcops case, there may be a race condition somewhere in executing the cookbook multiple times (maybe failures, etc..) [14:58:27] vgutierrez: yes yes :) [14:58:48] my goal is to figure out if the dcops use case is similar, namely if they got to the same point, or not [14:58:59] my suspicion is that the answer is no [15:00:32] elukey: I got a suspicion [15:00:42] the output of the hiera lookup is: [15:00:43] WARNING: unknown net_driver for interface ens3f0np0 [15:00:43] true [15:00:54] hence spicerack is badly parsing it due to the warning [15:03:54] but I'm not 100% sure, it's a guess [15:04:50] yep I had the same suspicion as well the last time [15:05:08] ah right we also added the extra log for what was returned right? [15:05:32] totally forgot about it [15:06:12] we should have "logger.info("Lookup result for force_puppet7: %s", has_puppet7)" [15:07:57] yepppp [15:07:58] 2025-02-25 14:49:49,695 vgutierrez 2873743 [INFO reimage.py:264 in _get_puppet_server] Lookup result for force_puppet7: Notice: Scope(Profile::Lvs::Interface_tweaks[ens3f0np0]): WARNING: unknown net_driver for interface ens3f0np0 [15:08:04] volans: --^ [15:08:12] confirmed [15:08:32] yeah.. that returns true [15:08:41] so forece_puppet7: true and cookbook decides to go with puppet5 [15:08:46] *force_puppet7 [15:08:56] vgutierrez: it's not true, it's notice... warning... true :D [15:09:12] exactly [15:09:14] but yeah we an fix it [15:09:16] and we match for "true" [15:09:29] very interestingly [15:09:31] 2025-02-21 19:17:28,807 jhancock 1910765 [INFO reimage.py:264 in _get_puppet_server] Lookup result for force_puppet7: true [15:09:31] we could make it output a one-line json and always pick the last line [15:09:42] yeah I just grepped on both hosts, this is the only case [15:09:46] so maybe in those cases it is not "true"? Maybe spaces? [15:09:55] unlikely [15:10:04] okok [15:12:04] interestingly enough we do already: [15:12:04] volans: then we have two corner cases, force_puppet7 not getting "true" and possibly another race condition that dcops hits [15:12:04] command = f"puppet lookup --render-as {fmt} --compile --node {fqdn} {key} 2>/dev/null" [15:12:25] so puppet prints that to stdout?!?! [15:12:33] I wouldn't be surprised :D [15:12:59] I need to find a host that has the same error [15:13:17] json doesn't help, it just prints true [16:35:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10579756 (10cmooney) Myself and Jenn went on a call with Brooke, Saju and some of the other Nokia technical folks. They couldn'... [16:38:41] 10netops, 06Infrastructure-Foundations, 06Traffic: eqiad/esams/drmrs LVS: use Netbox BGP flag - https://phabricator.wikimedia.org/T380469#10579766 (10ayounsi) 05Open→03Resolved All done ! There was no diff, as expected in the best case scenario. [18:19:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10580081 (10cmooney) Ok all devices are back online and reachable via SSH, all running SR Linux v24.7.2. Tomorrow I'll try to f...