[08:50:35] 10homer, 06Infrastructure-Foundations: Homer failure on port speed change - https://phabricator.wikimedia.org/T380147 (10ayounsi) 03NEW [08:51:20] 10homer, 06Infrastructure-Foundations: Homer failure on port speed change - https://phabricator.wikimedia.org/T380147#10330122 (10ayounsi) [10:17:48] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10330345 (10eoghan) a:03eoghan [11:06:28] 10netops, 06Infrastructure-Foundations, 06serviceops, 07Kubernetes: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10330555 (10JMeybohm) Beware of {T380142} [11:41:22] 10netops, 06Infrastructure-Foundations, 06serviceops, 07Kubernetes: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10330660 (10cmooney) >>! In T379790#10322697, @akosiaris wrote: > Cool, thanks.... [12:12:28] jhathaway: o/ [12:12:38] when you are online, thanos-be2005 is ready for reimage [12:13:15] 24 JBOD drives etc.. [12:13:21] fingers crossed :D [12:19:55] (resending) stupid question, how do I run a cookbook from another cookbook? I thought it'd just be a `cookbook_instance.get_runner().run()` or something like that, but I don't see such a thing anywhere in the code [12:22:19] kamila_: https://doc.wikimedia.org/spicerack/master/api/index.html#spicerack.Spicerack.run_cookbook [12:22:53] it's the caller responsibility to check the exit code and decide what to do with that [12:22:54] oh, thank you volans , not sure why I missed that '^^ [12:23:06] appreciated <3 [12:23:16] anytime :) [13:14:20] volans, XioNoX: I was going to restore a recent live netbox db backup to netbox-next unless there is any objection? [13:14:31] topranks: go for it [13:14:34] I want to get the current frack devices in there for some testing [13:14:36] ok [13:15:11] got for it [13:15:38] cool, thx [13:47:57] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Upgrade IDPs to CAS 6.6/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518#10331270 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This old task can be closed, the update to CAS 6.6 was resolved with T311235 and th... [13:51:08] volans: I'm tripping up on resetting the netbox password in the restored db [13:51:23] wikitech says just to run puppet which should correct it but it doesn't seem to be doing so [13:51:35] these are the kind of logs I'm seeing [13:51:36] [2024-11-18T13:50:21] django.db.utils.OperationalError: connection failed: connection to server at "10.192.48.10", port 5432 failed: FATAL: password authentication failed for user "netbox" [13:52:49] topranks: give me 2 and I'll have a look [13:52:57] no rush at all [13:54:05] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Adapt WMF theming for webauthn - https://phabricator.wikimedia.org/T380172 (10MoritzMuehlenhoff) 03NEW [13:54:18] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Adapt WMF theming for webauthn - https://phabricator.wikimedia.org/T380172#10331304 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:58:16] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Select data store for webauthn devices - https://phabricator.wikimedia.org/T380173 (10MoritzMuehlenhoff) 03NEW [13:58:22] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Select data store for webauthn devices - https://phabricator.wikimedia.org/T380173#10331325 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:01:38] in the past I had to re-set the permissions or password of the netbox user in postgres, probably the same here [14:02:53] yeah [14:03:18] looking at puppet it does use "postgresql::user" and so in theory the instructions to just run puppet are correct [14:03:48] but perhaps if the user exists it doesn't overwrite the existing password, it will only add if the user is missing [14:15:47] for the record resolved it by manually changing the password on it in postgres [14:24:18] topranks: sorry was stuck in some git rebase [14:24:33] glad you sovled, dunno why puppet run doesn't do it anymore, maybe we chanbged something in puppet along the way [14:24:42] volans: np, just glad to hear you made it through the rebase :) [14:24:55] yeah I'm not sure should I update the netbox doc on wikitech to say to do it manually? [14:25:08] probably? even better if you get why :D [14:25:09] puppetcode has the user def, but it doesn't seem to be resetting the password [14:25:16] which file? [14:25:27] in puppet [14:25:35] modules/profile/manifests/netbox/db.pp [14:26:34] I don't see any recent change [14:26:35] related [14:31:11] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Select optin method for webauthn - https://phabricator.wikimedia.org/T380178 (10MoritzMuehlenhoff) 03NEW [14:31:53] topranks: +1 to mention it in the doc for now until we find the root cause [14:32:46] I've updated a lot the upgrade section of the doc, including a clearer "testing" section [14:32:51] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Select opt-in method for webauthn - https://phabricator.wikimedia.org/T380178#10331471 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:33:53] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Evaluate supported for trusted devices - https://phabricator.wikimedia.org/T380179 (10MoritzMuehlenhoff) 03NEW [14:37:07] cool thanks I'll add a note now about it anyway [14:40:57] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180 (10MoritzMuehlenhoff) 03NEW [14:45:47] topranks: o/ when you have a moment do you mind to check the ranges in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091597 ? Just to make sure I haven't made wrong assumptions [14:45:51] not urgent, anytime [14:46:43] (the registry is behind the CDN and it uses nginx for TLS termination etc..) [14:53:07] elukey: the idea here is to capture all our internal address space? [14:53:32] topranks: basically yes [14:53:40] very coarse grain [14:54:03] 10.0.0.0/8 is fine for all the internal IPv4 "private" networks [14:54:15] we do have internal hosts on public networks as well though - not sure if they should be included? [14:55:33] nono in theory those shouldn't be concerned, we could restrict to few hosts but for the moment I don't want to limit too much [14:55:48] Ok [14:55:58] is there scope for the hosts involved? [14:56:14] elukey: thanks, I'll give a re-image a try after our meeting [14:57:07] topranks: what do you mean with scope in this case? (sorry just to avoid telling you something silly as first round :D) [14:57:26] like say the location [14:57:36] the ipv6 prefix you have there covers eqiad, codfw and ulsfo [14:57:38] https://netbox.wikimedia.org/ipam/prefixes/?depth__lte=0&page=1&within_include=2620%3A0%3A860%3A%3A/46 [14:58:00] so as long as that covers the hosts you have in mind should be ok [14:58:11] topranks: so all CDN hosts could in theory try to fetch /v2/_catalog proxying from outside, this is why I put it very broad [14:58:20] perfect thanks :) [14:58:28] ok right [14:58:42] "cdn hosts" - so we need to include the POPs too [15:02:17] elukey: maybe just duplicated and keep it (manually) in sync with https://github.com/wikimedia/operations-puppet/blob/production/modules/network/data/data.yaml#L23 ? [15:05:32] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Evaluate supported for trusted devices - https://phabricator.wikimedia.org/T380179#10331597 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:05:44] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180#10331598 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:43:44] 10netbox, 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: librenms report errors - https://phabricator.wikimedia.org/T379907#10331760 (10joanna_borun) p:05Triage→03Medium [15:48:51] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10331780 (10cmooney) p:05Triage→03Medium [15:49:45] 10homer, 06Infrastructure-Foundations: Homer failure on port speed change - https://phabricator.wikimedia.org/T380147#10331786 (10joanna_borun) p:05Triage→03Medium [15:51:50] 10netops, 06Infrastructure-Foundations, 06serviceops, 07Kubernetes: Reimage one of the wikikube-worker1240 to wikikube-worker1304 node in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10331792 (10cmooney) p:05Triage→03Medium [16:03:04] XioNoX: to get back to the patch - I'd be inclined to put the fence in place as is, and then refactor it with you and Cathal to be more streamlined with puppet's codebase (it is not a way to dodge the bullet, I'll do it I promise :D) [16:03:46] elukey: it depends on what needs to reach your service [16:03:49] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10331908 (10RobH) [16:05:00] XioNoX: for /v2/_catalog, only internal traffic, mostly build2001 and few other nodes [16:05:16] there you don't capture all the WMF hosts, and some without v4/v6 parity (eg. all of eqiad v6, but not eqiad hosts that are on public v4 IPs) [16:05:50] yes yes Cathal mentioned, but no host with a public ipv4 needs to contact the registry [16:05:58] better, needs to fetch its catalog [16:06:04] ok! [16:06:08] it is a very special use case, not a broad one [16:06:18] but sadly one that triggers a scan of swift :( [16:11:25] I think the patch as-is is ok from my point of view, as long as it covers all the source hosts you need [16:12:37] as long as you're aware of the limitations, it's fine :) [16:13:55] okok thanks :) [16:14:32] I also had a quick look at the squid proxy logs and nothing seems to use it to reach the registry [16:39:49] elukey for thanos-be2005, do I need to configure the jbod disks, before imaging [16:43:59] jhathaway: nono already done [16:44:33] great, thanks [16:50:54] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10332241 (10RobH) a:05RobH→03None [16:53:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10332236 (10RobH) 05Open→03Resolved a:03RobH @wiki_willy: I just wanted to notify you of this task's resolution and you'll see th... [16:54:56] elukey: ssh thanos-be2005.mgmt.codfw.wmnet, gives me a racadm prompt? [17:01:38] * elukey segfaults [17:02:33] :D [17:03:36] ok something is horribly wrong [17:04:04] https://netbox.wikimedia.org/dcim/devices/5635/ and https://phabricator.wikimedia.org/T370452 clearly state it is supermicro, just to avoid me getting totally mad [17:04:19] does it connect to the wrong IP? [17:04:32] I got a ssh key identification change [17:05:41] the IP looks the one in netbox https://netbox.wikimedia.org/ipam/ip-addresses/18534/ [17:06:01] I tried by ip as well, and I get a racadm prompt as well [17:06:04] so the only thing that changed is that I managed to make provision working [17:06:14] including the BMC network settings [17:06:20] so they got fully applied [17:06:39] maybe you upgraded the supermicro firmware with dell's ;P [17:07:57] hmm, surpisingly you can't search in netbox my mgmt ip? [17:09:06] ok I see [17:09:07] Warning: the ECDSA host key for 'thanos-be2005.mgmt.codfw.wmnet' differs from the key for the IP address '10.193.2.10' [17:11:57] maps-test2005 (WMF6748) [17:12:02] is the host I land on [17:12:50] though it mgmt ip is supposed to be, 10.193.3.58 [17:13:39] https://phabricator.wikimedia.org/T380144 [17:13:48] so perhaps a provisioning error [17:13:57] ok so this node is one of the testing host that are moving from ganeti decom to maps test, moritzm asked for those [17:14:44] but this is something really horrible that is happening [17:15:02] how could it be possible that we ended up with the same ip in two different places? [17:15:42] I'm not familiar with how we provision the OOB ips, is it automated? [17:17:39] in theory it should happen running https://netbox.wikimedia.org/extras/scripts/9/ [17:17:53] but it is netbox that picks up the IP, in theory [17:18:30] nod, thanks [17:20:30] the weird thing is that the maps-test2005 mgmt interface has a totally different IP https://netbox.wikimedia.org/dcim/devices/2143/interfaces/ [17:20:35] https://phabricator.wikimedia.org/rONEDeb5d0ff0361ad1ec8e4cdcfae2215d470dc2bd54 [17:20:40] how did you manage to find it was maps-test2005? [17:20:46] ganeti2015, had the same mgmt ip [17:21:02] i.e. the oob was not reprovisioned [17:21:24] or the re-provisioning failed? [17:21:47] ganeti2015 became maps-test2005 [17:22:17] https://netbox.wikimedia.org/dcim/devices/2143/changelog/ [17:22:26] perhaps we should factory reset the OOB controller as the last step in a decom [17:23:06] maybe let's jump on #dcops to understand what they did [17:23:13] sounds good [17:29:21] elukey: do you think we should add factory resetting the OOB controller as part of decom, I can add a ticket to phab [17:30:16] jhathaway: it never happened before, I'd like to understand what it was done by dcops before.. resetting the OOB should be easy enough, but for example on Supermicro it doesn't clear the BMC network settings [17:30:44] it is a "feature" that I still haven't fully got (maybe we could override the IPs, or set them to none, no idea) [17:30:46] nod, yeah clearning the networking settings would be the key part, or we could set them to bogus values [17:32:16] I'll cut a ticket, and add dcops [17:40:31] ack! [17:40:41] going afk, lemme know how it goes! [17:41:19] will do, thanks [18:09:27] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10332939 (10eoghan) We had a quick chat with ITS today where they disabled the change that caused the routing to change, an... [20:18:56] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10333580 (10jhathaway) >>! In T380009#10332939, @eoghan wrote: > We had a quick chat with ITS today where they disabled the... [20:23:23] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10333590 (10revi) >>! In T380009#10332939, @eoghan wrote: > We had a quick chat with ITS today where they disabled the chan... [20:37:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:42:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:34] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10333786 (10eoghan) @jhathaway It was a rule set up to change the envelope-to of a mail from a given source. When we disabl...