[00:00:48] so the changes under `modules/profile/files` are all noop for that reason, and the changes under `modules/profile/manifests` are Puppet comments so they're noop too [00:01:08] if there were templates you'd see a diff there, but there don't happen to be any [00:03:55] (that is, you'd see a diff if the header were added as `# SPDX...` so that it passed through into the output, as with the changes under files/ -- but in practice, since they're being added as template comments `<%#- SPDX... -%>` there's no diff in the output there either) [00:18:38] I hadn't noticed that behavior before with `source =>`. Thanks for the explanation, rzl! [05:07:52] <_joe_> ori: no one ever needed more than one override file [05:08:15] <_joe_> but yes, we can add a systemd::override type if we want to add multiple [06:40:41] There seems to have been a lot of mgmt alerts last night [06:41:05] 05:39:45 PROBLEM - DNS on elastic1085.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.222 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:41:19] Seems to be the main not have recovered [07:09:45] hello folks [07:09:56] rack B7 in eqiad seems having troubles with the mgmt interfaces [07:10:35] do you know if there is any ongoing work/issue? [07:12:47] there are also other mgmt alerts like "DNS CRITICAL - expected '0.0.0.0' but got '10.65.3.130'" [07:12:54] (this one for kafka-main1002) [07:15:53] mmmmmm [07:22:46] the b7 interface on msw1 looks up [07:33:28] lemme know if you have ideas about how to debug it, otherwise I can open a task for netops/dcops [07:34:20] the interfaces are not reachable from cumin1001 afaics, but so far I haven't found a clear sign of a problem [08:22:11] elukey: thanks for the heads up [08:22:23] Yeah we are not learning any MAC addresses from the management switch in B7 [08:22:26] https://phabricator.wikimedia.org/P35416 [08:23:11] The port from msw1-eqiad to it is "up", so doesn't look like it's suffered a total power loss, but we see zero bits incomiing [08:23:25] I'll file a ticket for DC-ops to investigate, reboot will likely sort it [08:23:56] topranks: thanks! I always forget to check the MAC addresses :D [08:24:37] need to put a note on my desk, every time it is arp or similar [08:24:40] np... it seems to have dropped at about ~01:40 UTC from the graph [08:24:42] https://librenms.wikimedia.org/graphs/to=1665562800/id=2961/type=port_bits/from=1665476400/ [08:24:47] haha yeah :) [08:28:00] elukey: it's always DNS unless it's MTU [08:28:19] XioNoX: +1 [08:45:13] topranks: do you have the task handy? I'll ack the icinga alerts [08:46:04] ah yeah should have occurred to me to do that [08:46:05] https://phabricator.wikimedia.org/T320598 [08:54:58] would this affect traffic to/from VMs in any way? (I'm seeing one VM unreachable through ssh, triggered a few min before) [08:55:34] dcaro: no it shouldn't, that's for physical servers (iDRAC) management interfaces [08:55:54] I guessed so, but better be sure xd, thanks! [08:55:59] np! [12:15:55] mark: amazing story of preparing for World IPv6 day in less than a week!!! wow :) [12:16:16] :) [12:16:23] we call it IPv6 Launch Week internally ;) [12:17:02] we were on squid at the time, which had no ipv6 support [12:17:05] pybal had no ipv6 support [12:17:10] servers had no ipv6 addresses yet, etc etc [12:22:29] topranks: world ipv6 /launch/ day, because confusingly enough, "world ipv6 day" was the year before! [12:22:37] good times [12:23:03] that's incredible [12:23:27] I think I remember that, World IPv6 Day lots of big players published AAAA records for 1 day right? [12:23:29] To test it? [12:23:40] Then a year later switched it on permanently? [12:23:41] yes [12:23:42] yes [12:24:08] so you did it all, without even the test day, in less than a week :) [12:24:52] speaking of [12:24:59] we still haven't published AAAA glue records [12:25:42] originally on purpose, given that the geoip information for ipv6 was going to be less accurate [12:25:47] these days it may finally make sense though! [12:26:08] hmm interesting, yeah only last week I noticed there were no AAAA records for nsX.wikimedia.org [12:26:36] but good to know the reason, I don't have insight into geo-accuracy but for sure one to discuss with traffic [12:35:52] the IPv6 test day, the year before, we couldn't do in time because it turned out our mediawiki installs weren't ready and needed schema changes to support ipv6 addresses [12:35:59] and that would take weeks/months [12:36:07] so that work did start then, but very little else [12:36:18] and then the year after, i described [12:37:52] looks like Mozilla went backward, there is no v6 records anymore [12:38:50] clearly regressing since your departure Arzhel :) [12:40:25] hahaha [13:30:48] anybody has done anything from cumin2002 to cloudvirt nodes ~1h ago? There was an ssh session disconnecting with error to some of them, and a service started failing almost after (not sure if they are related, just trying to figure out what happened for the service to go down) [13:31:26] the task I'm investigating on is T320630 (for more details) [13:31:27] T320630: multiple cloudvirts: systemd-machined systemd unit failed - https://phabricator.wikimedia.org/T320630 [13:31:58] it did not cause any outages, but it's a really weird behavior :/ (a bit troubling if it affects a different service at some point) [13:32:30] it was over ip6 btw [13:32:47] dcaro: I rolled out the dbus security updates around that time. dbus doesn't get restarted at run time (it can only really restart upon boot), but maybe there's some side effect for the cloudvirt/Openstack setup? [13:33:21] dbus is at least sufficiently low level to cause a ripple effect towards systemd-machined [13:33:39] maybe, it was systemd-machined that ended up timing out, so maybe it disconnected from dbus and was not able to connect [13:33:46] ? [13:34:01] maybe, let me have a look at journalctl on an affected node [13:34:07] 👍 [13:40:38] Emperor, would you be willing to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/828664 ? [13:41:21] zabe: do you and theresnotime lack +2 access? [13:41:37] yes [13:41:41] (yup) [13:41:43] Ah, OK. [13:41:58] * Emperor will go Do The Thing [13:42:01] (fine by me, puppet is scary) [13:42:58] zabe / TheresNoTime: done. [13:44:05] Emperor, when you are already on it, could you also merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/836953 ? [13:45:15] Thanks :) [13:45:20] zabe: I go, and 'tis done [13:53:37] dcaro: I added my findings to the Phab task, but I'm afraid I don't have a good suggestion on how to fix this [13:53:47] 👍 thanks! [13:57:35] in general dbus updates are rare and this did not seem to have a functional impact besides the alerts, so the next time we need to update dbus I'll give WMCS folks a headsup before the rollout [13:58:22] this also seems to have been introduced with some openstack changes either between buster/bullseye or via newer openstack releases, since I'm very sure we haven't seen that before when updating dbus [13:59:34] yep, I think that it's not a big issue yes, though we are looking a bit more closely to alerts too, so maybe it just went unnoticed. In any case, the ping beforehand will be appreciated :) [14:00:32] will do :-) [16:15:04] I've got some small pcaps ( < 1 MB) that don't contain sensitive information, what is the best place to upload those so others can see? [16:15:56] inflatador: maybe people.wikimedia.org [16:16:13] https://wikitech.wikimedia.org/wiki/People.wikimedia.org [16:17:33] TheresNoTime: of course let me know if you need help with PeeringDB [16:18:14] XioNoX: :D I was going to say, I'm already on at https://www.peeringdb.com/net/30909 (though it could probably do with an update..) [16:19:15] XioNoX ACK, thanks [16:19:32] TheresNoTime: ah yeah I carried over a typo, that's why I couldn't fine it [16:19:35] find* [16:20:08] So you would need to be present at one of the same "Public Peering Exchange Points" as us [16:22:22] so I'm "at" France-IX via another provider, though I'm not sure if that's the same as ^ (please do excuse daft questions, I've had an ASN for all of a month and BGP is horrible) [16:22:53] just please don't crash the internet [16:23:11] I've been trying, no joy yet! :P [16:24:15] TheresNoTime: not sure I understand what you mean with "I'm "at" France-IX via another provider" [16:25:43] XioNoX: let me take another look a moment [16:32:18] urgh, no, I have a VPS at France-IX (via https://www.virtua.cloud/our-infrastructure/our-network) and thought that meant I could peer with y'all at France-IX ... seems I was incorrect :( [16:35:15] yeah I guess you have VPS in France-IX's facility, which is a little different than what peering is, even though the name+location are lined up :) [16:35:39] we'd have to peer at France-IX with France-IX's own VPS network, whatever that is, for that to work. [16:36:03] damn [16:37:06] you might be in "France-IX Services"? I donno: https://www.peeringdb.com/net/8267 [16:37:32] could probably reverse engineer the situation based on your VPS's public IP [16:42:10] it's in 185.10.16.0/23, so https://www.peeringdb.com/net/19185 [16:52:00] ah yeah [16:52:10] so they're present at France-IX Paris, but not Marseille [16:52:41] we're at Marseille and not Paris [16:52:52] https://www.peeringdb.com/net/1365 [16:55:28] Oh well, will have to park that geek cred idea for another day [16:55:45] Appreciate y'all taking a look though :)