[07:00:00] fabfur, sukhe: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/935479/1#message-27bb851f6b6fa829030f56b4510fbcf7fa9a0202 [07:12:52] Ok thanks! [07:23:45] fabfur: your yubikey 5 handles EC keys perfectly fine [07:24:20] Key attributes ...: ed25519 cv25519 ed25519 [07:30:59] <3 [07:38:47] (SystemdUnitFailed) firing: anycast-healthchecker.service Failed on durum2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:47] (SystemdUnitFailed) resolved: anycast-healthchecker.service Failed on durum2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:04] vgutierrez: yep, today I'll regenerate the key and open the CRs for puppet and Homer [08:25:28] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:25:42] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:25:50] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) 05Open→03Resolved [08:25:58] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:26:22] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) 05Open→03Resolved This is done from our end. [09:01:08] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) a:03ayounsi [09:41:14] 10Traffic, 10Phabricator, 10SRE: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10Aklapper) 05Open→03Resolved Optimistically resolving as T253632 is resolved. Please reopen if this is still an issue - thanks! [11:35:37] 10Traffic, 10netops, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) 05Open→03Resolved a:03cmooney Still stable so I will close this for now, if it re-occurs we can engage Juniper. [13:16:55] XioNoX: apologies! that's on me as I reviewed and merged the patch; I did remember the deprecation of the RSA keys but I forgot the deadline in July (T336769) [13:16:56] T336769: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 [13:17:21] I see that fabfur has a new patch uploaded so happy to merge that [13:17:52] thanks sukhe ! [13:28:10] sukhe: no pb at all! I have a ton of patches to merge [13:28:16] so I can take care of it [13:28:23] thansk! [13:28:27] thanks even [13:28:30] sukhe: hey [13:28:35] just had a discussion with Arzhel and Ricarrdo on the plan for Amsterdam [13:28:50] we were thinking of changing the plan slightly if it works for you guys [13:29:08] the TL;DR being to keep the esams name / "3" prefix as discussed yesterday [13:29:24] but to use *new* IP ranges (private and public) for everything [13:31:22] does that sound ok from your point of view? [13:32:18] One significant difference is it means new VIPs announced from the LVS, so DNS changes to re-point things at the new IP(s) [13:32:41] and also ns2 will need to get its glue record changed [13:33:06] indeed yep, although I think that would have been required anyway [13:33:21] but now it'll change to a new IP in a different /24, as opposed to staying within the same /24 [13:34:21] topranks: hello! [13:34:49] sure, I don't see any issues from our end [13:34:56] and yeah, we will update glue records for ns2 [13:36:31] on that we were thinking we can assign the new ns2 IP in advance [13:36:43] yeah not a bad idea [13:36:51] then configure it on the current ns2 lo interface, or even ns0/ns1 possibly [13:36:57] and route it to that in advance [13:37:37] once we are happy we are serving users from both the current and new ns2 IP we can update the records with the registrar [13:37:37] but that has to be an either or right? I mean, we either do ns0/1 or we do ns2 (given that we can't announce from two places?) [13:38:11] we can route the traffic internally. i.e. if the /24 is announced in Amsterdam we can backhaul on our own network to any other pop [13:38:23] or other options, but basically we have some flexibility there [13:38:25] right, that way, ok. [13:39:00] topranks, sukhe: https://wikitech.wikimedia.org/wiki/Service_restarts#Authoritative_DNS [13:39:39] XioNoX: yes, which part about this though :) [13:40:02] for a reason to configure the new IP on ns0 as well [13:40:31] XioNoX: thanks for the link [13:40:42] yeah [13:40:51] are the instructions not missing a bit? [13:40:54] we had backup routes for ns0/1 in both eqiad/codfw up till very recently [13:41:00] but those have gone away [13:41:09] oh ok [13:41:10] i.e. to re-route ns0 to dns2001 do we not delete the static in eqiad, and ADD the static in codfw ? [13:41:35] sukhe: do all 3 still have all the IPs configured on the hosts themselves ? [13:42:27] topranks: do you mean all three namservers? [13:42:45] right now, ns0 goes to dns100[4-6] (since yesterday) [13:42:49] and ns1 goes to dns200[4-6] [13:43:45] sukhe@re0.cr1-eqiad# show routing-options static route 208.80.154.238/32 next-hop [13:43:48] next-hop [ 208.80.154.6 208.80.154.153 208.80.154.77 ]; [13:44:23] sukhe: I meant on the servers themselves [13:44:29] cmooney@dns2004:~$ ip -br addr show dev lo [13:44:29] lo UNKNOWN 127.0.0.1/8 208.80.154.238/32 208.80.153.231/32 91.198.174.239/32 10.3.0.1/32 198.35.27.27/32 ::1/128 [13:44:59] Which look to have all the public IPs assigned on all of them [13:45:04] yep [13:45:09] probaby stemming from this: [13:45:13] in other words it's the routing on the CRs that controls where things go [13:45:21] once up a time, in cr*-eqiad, we had backup routes for ns1 [13:45:24] - /* ns1 */ [13:45:24] - route 208.80.153.231/32 { [13:45:24] - next-hop 208.80.154.10; [13:45:24] - readvertise; [13:45:24] - no-resolve; [13:45:26] - preference 200; [13:45:29] - } [13:46:34] sure [13:46:43] So I think what we'd be talking about here is [13:47:07] - Assign new IP which will eventually replace current ns2 public [13:47:18] - Add that to lo interface on all dns servers [13:48:07] - Route the new IP to ns2 in Amsterdam [13:48:17] - Validate we are serving users on both old and new ns2 IP [13:48:26] - Change the registrar IP for ns2 to the new IP [13:48:57] Some time after that we can remove the old ns2 IP (91.198.174.239) from servers/routes [13:49:17] so [13:49:27] route the new IP to ns2 in Amsterdam, announcing it from ns0/1? [13:49:33] During the knams migration we can re-route that new ns2 IP to eqiad/codfw if we wish [13:49:34] as in, eqiad/codfw? [13:51:44] no, sorry I phrased that wrong [13:51:58] - Route the new IP to dns3001 and dns3002 [13:52:21] ah [13:52:44] during the week we are in Amsterdam doing migration we may wish to redirect it to codfw/eqiad though yes [13:53:03] but what I was outlining was basically move to a new IP for ns2 prior to the other work [13:53:32] if it makes sense to do it that way [13:53:53] idea being less moving parts during migration, records with registrar etc. have already been changed and "propagated" [13:55:03] yep, agreed, ns2 is the most critical part of the operation anyway in that sense [13:55:13] ok, I don't see any immediate concern with it, thanks [13:56:31] topranks: just out of curiosity, what necessitates moving to new IP ranges? [13:56:45] versus reusing the same ones from esams? [13:57:24] "necessitates" is a strong word :) [13:57:28] :) [13:57:30] haha [13:57:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [13:58:18] cleaner as it allows us to use the same cookie-cutter allocation than drmrs and other sites [13:58:26] it basically means we can assign the IPs so they 1:1 match what we have in drmrs [13:58:50] it allows us to start configuring devices ahead of time [13:58:53] ok [13:59:04] remove some of the missconfig that built up over time [14:00:45] ok :) helps with the OCD too! [14:00:53] (which is why I wanted to vote for moving away from 3x :P) [14:01:06] I tried this too ^ [14:01:07] :) [14:01:55] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) a:03BBlack Assigning the task to @BBlack for when he comes back. [14:02:13] hey, if you all want that we can totally do it, just not mix things up, XXams/7xxx, but be prepared to do all the required patches in all the repos [14:02:16] ;) [14:02:41] volans: that ship has sailed! [14:02:54] when? [14:03:21] a ship can't sail without volans on board [14:03:23] well we decided that in two meetings ago or something [14:03:41] and I don't want to be the one that opens that discussion again to further confuse things [14:08:35] tbh if there is enough consensus and bandwidth to do it (but don't count in me) I don't think that ship has sailed [14:11:24] 10netops, 10Infrastructure-Foundations, 10SRE: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10RobH) [14:19:02] topranks: any preference from your side? [14:19:21] XioNoX seems to be on not reusing 3x [14:19:30] and I have the same opinion fwiw [14:19:39] we should probably check with dc-ops too [14:19:44] that is, if everyone thinks this is worth pursuing [14:20:21] I've no objection [14:21:10] did I hear correctly yesterday that the plan is to reuse the name esams for the new knams? :) [14:22:12] question_mark: welcome to the party :) [14:22:25] is this the bikeshedding party? ;) [14:22:48] that was the provisional plan yes, but being discussed above and perhaps re-evaluated [14:23:03] question_mark: bikeshedding but the bike shed needs to be built yes :) [14:23:45] eh, we discussed it more earlier today, even though I was in favor of using knams and a different number, the tradeoff is that it would require much more ressources than keeping esams/3x [14:24:39] that old naming scheme has a flaw :( the vendor names change all the time [14:24:40] using different IP ranges brings most benefits of a greenfield deployment [14:25:07] esams is no longer correct, knams hasn't been correct since about 2008 [14:25:12] yeah [14:25:13] yeah both names are currently invalid - 'knams' would be 'drams' now [14:25:19] but I'm sure that'll change again next week [14:25:22] indeed [14:25:36] so we could opt to just abandon that too, if you're changing names anyway... but only if you are [14:25:43] drams is too close to drama :) [14:25:47] ams1 ams2 etc probably more conventional ;) [14:25:51] so our take on that was we accept that the name doesn't reflect the site owner, it's just our internal name [14:26:03] yeah that would be better [14:26:38] yeah that would be the way to go if changing it. but not sure if there is pressing need. [14:27:29] yeah too many things are built on the 5 letters assumption [14:27:36] fair enough [14:27:36] yeah, it's almost natural now [14:28:03] aaams, abams, acams ;p [14:28:04] let's keep the popcorn for when we open our 10th POP [14:28:24] Y2K moment [14:28:35] If you thought IPv4 exhaustion was tricky wait till we get to that [14:31:59] so... what's the decision on this then :) [14:32:10] we should probably decide sooner than later [14:32:36] I guess dc-ops is missing from this discussion and I *think* (don't want to ascribe) rob and papaul might have some thoughts too [16:40:37] 10Traffic, 10Phabricator, 10SRE, 10SecTeam-Processed: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10sbassett) [18:38:01] 10Traffic: Investigate why HAProxy SLO Grafana dashboard has negative values on combined SLI - https://phabricator.wikimedia.org/T341606 (10BCornwall) [18:42:47] 10Traffic: Investigate why HAProxy SLO Grafana dashboard has negative values on combined SLI - https://phabricator.wikimedia.org/T341606 (10BCornwall) 05Open→03In progress p:05Triage→03Low [19:01:35] 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) [19:01:48] 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) p:05Triage→03Medium a:03ssingh [19:02:37] 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) Affected hosts: ` sukhe@cumin2002:~$ sudo cumin 'C:dnsrecursor' 30 hosts will be targeted: cloudservices[2004-2005]-dev.codfw.wmnet,cloudservices[1004-1005].wikimedia.org,dns[1004-1006,2004-2006,3001-3002,4003-4004,... [19:03:35] 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) [21:15:44] I'm at a step on this doc page that says "Ask on #wikimedia-traffic which are the backup LVS server for the LVS class of your service on both datacentres and restart pybal on those" [21:15:52] regarding https://gerrit.wikimedia.org/r/c/operations/puppet/+/831173/2/hieradata/role/eqiad/wmcs/openstack/eqiad1/labweb.yaml [21:15:56] so... which are the backup servers? [21:16:06] (or better yet, would someone braver than me like to do the restarts?) [21:26:11] 10Traffic, 10cloud-services-team: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10Andrew) I merged the above patches and briefly triggered a couple of alerts like "PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2... [21:26:21] ok, well -- when some of you come back online I'd appreciate a look at the fallout from T317463. Everything looks OK to me but there may be bits I've missed. TY! [21:26:22] T317463: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 [21:47:29] 10netops, 10Infrastructure-Foundations: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10BTullis) Removing #data-engineering as I think that #infrastructure-foundations is on top of it.