[07:00:00] <XioNoX>	 fabfur, sukhe: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/935479/1#message-27bb851f6b6fa829030f56b4510fbcf7fa9a0202
[07:12:52] <fabfur>	 Ok thanks!
[07:23:45] <vgutierrez>	 fabfur: your yubikey 5 handles EC keys perfectly fine
[07:24:20] <vgutierrez>	 Key attributes ...: ed25519 cv25519 ed25519
[07:30:59] <XioNoX>	 <3
[07:38:47] <jinxer-wm>	 (SystemdUnitFailed) firing: anycast-healthchecker.service Failed on durum2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:43:47] <jinxer-wm>	 (SystemdUnitFailed) resolved: anycast-healthchecker.service Failed on durum2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:06:04] <fabfur>	 vgutierrez: yep, today I'll regenerate the key and open the CRs for puppet and Homer
[08:25:28] <wikibugs>	 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:25:42] <wikibugs>	 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:25:50] <wikibugs>	 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) 05Open→03Resolved
[08:25:58] <wikibugs>	 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:26:22] <wikibugs>	 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) 05Open→03Resolved This is done from our end.
[09:01:08] <wikibugs>	 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) a:03ayounsi
[09:41:14] <wikibugs>	 10Traffic, 10Phabricator, 10SRE: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10Aklapper) 05Open→03Resolved Optimistically resolving as T253632 is resolved. Please reopen if this is still an issue - thanks!
[11:35:37] <wikibugs>	 10Traffic, 10netops, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) 05Open→03Resolved a:03cmooney Still stable so I will close this for now, if it re-occurs we can engage Juniper.
[13:16:55] <sukhe>	 XioNoX: apologies! that's on me as I reviewed and merged the patch; I did remember the deprecation of the RSA keys but I forgot the deadline in July (T336769)
[13:16:56] <stashbot>	 T336769: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769
[13:17:21] <sukhe>	 I see that fabfur has a new patch uploaded so happy to merge that
[13:17:52] <fabfur>	 thanks sukhe !
[13:28:10] <XioNoX>	 sukhe: no pb at all! I have a ton of patches to merge
[13:28:16] <XioNoX>	 so I can take care of it
[13:28:23] <sukhe>	 thansk!
[13:28:27] <sukhe>	 thanks even
[13:28:30] <topranks>	 sukhe: hey
[13:28:35] <topranks>	 just had a discussion with Arzhel and Ricarrdo  on the plan for Amsterdam
[13:28:50] <topranks>	 we were thinking of changing the plan slightly if it works for you guys
[13:29:08] <topranks>	 the TL;DR being to keep the esams name / "3" prefix as discussed yesterday 
[13:29:24] <topranks>	 but to use *new* IP ranges (private and public) for everything 
[13:31:22] <topranks>	 does that sound ok from your point of view?
[13:32:18] <topranks>	 One significant difference is it means new VIPs announced from the LVS, so DNS changes to re-point things at the new IP(s)
[13:32:41] <volans>	 and also ns2 will need to get its glue record changed
[13:33:06] <topranks>	 indeed yep, although I think that would have been required anyway 
[13:33:21] <topranks>	 but now it'll change to a new IP in a different /24, as opposed to staying within the same /24
[13:34:21] <sukhe>	 topranks: hello!
[13:34:49] <sukhe>	 sure, I don't see any issues from our end
[13:34:56] <sukhe>	 and yeah, we will update glue records for ns2 
[13:36:31] <topranks>	 on that we were thinking we can assign the new ns2 IP in advance
[13:36:43] <sukhe>	 yeah not a bad idea
[13:36:51] <topranks>	 then configure it on the current ns2 lo interface, or even ns0/ns1 possibly
[13:36:57] <topranks>	 and route it to that in advance
[13:37:37] <topranks>	 once we are happy we are serving users from both the current and new ns2 IP we can update the records with the registrar 
[13:37:37] <sukhe>	 but that has to be an either or right? I mean, we either do ns0/1 or we do ns2 (given that we can't announce from two places?)
[13:38:11] <topranks>	 we can route the traffic internally.  i.e. if the /24 is announced in Amsterdam we can backhaul on our own network to any other pop 
[13:38:23] <topranks>	 or other options, but basically we have some flexibility there 
[13:38:25] <sukhe>	 right, that way, ok. 
[13:39:00] <XioNoX>	 topranks, sukhe: https://wikitech.wikimedia.org/wiki/Service_restarts#Authoritative_DNS
[13:39:39] <sukhe>	 XioNoX: yes, which part about this though :)
[13:40:02] <XioNoX>	 for a reason to configure the new IP on ns0 as well
[13:40:31] <topranks>	 XioNoX: thanks for the link 
[13:40:42] <sukhe>	 yeah
[13:40:51] <topranks>	 are the instructions not missing a bit?
[13:40:54] <sukhe>	 we had backup routes for ns0/1 in both eqiad/codfw up till very recently
[13:41:00] <sukhe>	 but those have gone away
[13:41:09] <XioNoX>	 oh ok
[13:41:10] <topranks>	 i.e. to re-route ns0 to dns2001 do we not delete the static in eqiad, and ADD the static in codfw ?
[13:41:35] <topranks>	 sukhe: do all 3 still have all the IPs configured on the hosts themselves ?
[13:42:27] <sukhe>	 topranks: do you mean all three namservers?
[13:42:45] <sukhe>	 right now, ns0 goes to dns100[4-6] (since yesterday)
[13:42:49] <sukhe>	 and ns1 goes to dns200[4-6]
[13:43:45] <sukhe>	 sukhe@re0.cr1-eqiad# show routing-options static route 208.80.154.238/32 next-hop 
[13:43:48] <sukhe>	 next-hop [ 208.80.154.6 208.80.154.153 208.80.154.77 ];
[13:44:23] <topranks>	 sukhe: I meant on the servers themselves
[13:44:29] <topranks>	 cmooney@dns2004:~$ ip -br addr show dev lo
[13:44:29] <topranks>	 lo               UNKNOWN        127.0.0.1/8 208.80.154.238/32 208.80.153.231/32 91.198.174.239/32 10.3.0.1/32 198.35.27.27/32 ::1/128 
[13:44:59] <topranks>	 Which look to have all the public IPs assigned on all of them
[13:45:04] <sukhe>	 yep
[13:45:09] <sukhe>	 probaby stemming from this:
[13:45:13] <topranks>	 in other words it's the routing on the CRs that controls where things go 
[13:45:21] <sukhe>	 once up a time, in cr*-eqiad, we had backup routes for ns1
[13:45:24] <sukhe>	 -    /* ns1 */
[13:45:24] <sukhe>	 -    route 208.80.153.231/32 {
[13:45:24] <sukhe>	 -        next-hop 208.80.154.10;
[13:45:24] <sukhe>	 -        readvertise;
[13:45:24] <sukhe>	 -        no-resolve;
[13:45:26] <sukhe>	 -        preference 200;
[13:45:29] <sukhe>	 -    }
[13:46:34] <topranks>	 sure 
[13:46:43] <topranks>	 So I think what we'd be talking about here is 
[13:47:07] <topranks>	 - Assign new IP which will eventually replace current ns2 public 
[13:47:18] <topranks>	 - Add that to lo interface on all dns servers 
[13:48:07] <topranks>	 - Route the new IP to ns2 in Amsterdam
[13:48:17] <topranks>	 - Validate we are serving users on both old and new ns2 IP
[13:48:26] <topranks>	 - Change the registrar IP for ns2 to the new IP 
[13:48:57] <topranks>	 Some time after that we can remove the old ns2 IP (91.198.174.239) from servers/routes
[13:49:17] <sukhe>	 so
[13:49:27] <sukhe>	 route the new IP to ns2 in Amsterdam, announcing it from ns0/1?
[13:49:33] <topranks>	 During the knams migration we can re-route that new ns2 IP to eqiad/codfw if we wish
[13:49:34] <sukhe>	 as in, eqiad/codfw?
[13:51:44] <topranks>	 no, sorry I phrased that wrong
[13:51:58] <topranks>	 - Route the new IP to dns3001 and dns3002 
[13:52:21] <sukhe>	 ah
[13:52:44] <topranks>	 during the week we are in Amsterdam doing migration we may wish to redirect it to codfw/eqiad though yes
[13:53:03] <topranks>	 but what I was outlining was basically move to a new IP for ns2 prior to the other work 
[13:53:32] <topranks>	 if it makes sense to do it that way 
[13:53:53] <topranks>	 idea being less moving parts during migration, records with registrar etc. have already been changed and "propagated"
[13:55:03] <sukhe>	 yep, agreed, ns2 is the most critical part of the operation anyway in that sense 
[13:55:13] <sukhe>	 ok, I don't see any immediate concern with it, thanks
[13:56:31] <sukhe>	 topranks: just out of curiosity, what necessitates moving to new IP ranges?
[13:56:45] <sukhe>	 versus reusing the same ones from esams?
[13:57:24] <topranks>	 "necessitates" is a strong word :)
[13:57:28] <XioNoX>	 :)
[13:57:30] <sukhe>	 haha
[13:57:37] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi)
[13:58:18] <XioNoX>	 cleaner as it allows us to use the same cookie-cutter allocation than drmrs and other sites
[13:58:26] <topranks>	 it basically means we can assign the IPs so they 1:1 match what we have in drmrs 
[13:58:50] <XioNoX>	 it allows us to start configuring devices ahead of time
[13:58:53] <sukhe>	 ok
[13:59:04] <XioNoX>	 remove some of the missconfig that built up over time
[14:00:45] <sukhe>	 ok :) helps with the OCD too!
[14:00:53] <sukhe>	 (which is why I wanted to vote for moving away from 3x :P)
[14:01:06] <XioNoX>	 I tried this too ^
[14:01:07] <XioNoX>	  :)
[14:01:55] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) a:03BBlack Assigning the task to @BBlack for when he comes back.
[14:02:13] <volans>	 hey, if you all want that we can totally do it, just not mix things up, XXams/7xxx, but be prepared to do all the required patches in all the repos
[14:02:16] <volans>	 ;)
[14:02:41] <sukhe>	 volans: that ship has sailed! 
[14:02:54] <volans>	 when?
[14:03:21] <XioNoX>	 a ship can't sail without volans on board
[14:03:23] <sukhe>	 well we decided that in two meetings ago or something
[14:03:41] <sukhe>	 and I don't want to be the one that opens that discussion again to further confuse things
[14:08:35] <volans>	 tbh if there is enough consensus and bandwidth to do it (but don't count in me) I don't think that ship has sailed
[14:11:24] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10RobH)
[14:19:02] <sukhe>	 topranks: any preference from your side?
[14:19:21] <sukhe>	 XioNoX seems to be on not reusing 3x
[14:19:30] <sukhe>	 and I have the same opinion fwiw
[14:19:39] <sukhe>	 we should probably check with dc-ops too
[14:19:44] <sukhe>	 that is, if everyone thinks this is worth pursuing 
[14:20:21] <topranks>	 I've no objection 
[14:21:10] <question_mark>	 did I hear correctly yesterday that the plan is to reuse the name esams for the new knams? :)
[14:22:12] <topranks>	 question_mark: welcome to the party :)
[14:22:25] <question_mark>	 is this the bikeshedding party? ;)
[14:22:48] <topranks>	 that was the provisional plan yes, but being discussed above and perhaps re-evaluated 
[14:23:03] <sukhe>	 question_mark: bikeshedding but the bike shed needs to be built yes :)
[14:23:45] <XioNoX>	 eh, we discussed it more earlier today, even though I was in favor of using knams and a different number, the tradeoff is that it would require much more  ressources than keeping esams/3x
[14:24:39] <question_mark>	 that old naming scheme has a flaw :( the vendor names change all the time
[14:24:40] <XioNoX>	 using different IP ranges brings most benefits of a greenfield deployment
[14:25:07] <question_mark>	 esams is no longer correct, knams hasn't been correct since about 2008
[14:25:12] <XioNoX>	 yeah
[14:25:13] <topranks>	 yeah both names are currently invalid - 'knams' would be 'drams' now 
[14:25:19] <topranks>	 but I'm sure that'll change again next week
[14:25:22] <question_mark>	 indeed
[14:25:36] <question_mark>	 so we could opt to just abandon that too, if you're changing names anyway... but only if you are
[14:25:43] <XioNoX>	 drams is too close to drama :)
[14:25:47] <question_mark>	 ams1 ams2 etc probably more conventional ;)
[14:25:51] <topranks>	 so our take on that was we accept that the name doesn't reflect the site owner, it's just our internal name
[14:26:03] <XioNoX>	 yeah that would be better
[14:26:38] <topranks>	 yeah that would be the way to go if changing it.  but not sure if there is pressing need.
[14:27:29] <XioNoX>	 yeah too many things are built on the 5 letters assumption
[14:27:36] <question_mark>	 fair enough
[14:27:36] <sukhe>	 yeah, it's almost natural now
[14:28:03] <question_mark>	 aaams, abams, acams ;p
[14:28:04] <XioNoX>	 let's keep the popcorn for when we open our 10th POP
[14:28:24] <sukhe>	 Y2K moment 
[14:28:35] <topranks>	 If you thought IPv4 exhaustion was tricky wait till we get to that 
[14:31:59] <sukhe>	 so... what's the decision on this then :)
[14:32:10] <sukhe>	 we should probably decide sooner than later
[14:32:36] <sukhe>	 I guess dc-ops is missing from this discussion and I *think* (don't want to ascribe) rob and papaul might have some thoughts too
[16:40:37] <wikibugs>	 10Traffic, 10Phabricator, 10SRE, 10SecTeam-Processed: Accessing Phabricator from Tor (some ranges blocked but not others) - https://phabricator.wikimedia.org/T254568 (10sbassett)
[18:38:01] <wikibugs>	 10Traffic: Investigate why HAProxy SLO Grafana dashboard has negative values on combined SLI - https://phabricator.wikimedia.org/T341606 (10BCornwall)
[18:42:47] <wikibugs>	 10Traffic: Investigate why HAProxy SLO Grafana dashboard has negative values on combined SLI - https://phabricator.wikimedia.org/T341606 (10BCornwall) 05Open→03In progress p:05Triage→03Low
[19:01:35] <wikibugs>	 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh)
[19:01:48] <wikibugs>	 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) p:05Triage→03Medium a:03ssingh
[19:02:37] <wikibugs>	 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) Affected hosts:  ` sukhe@cumin2002:~$ sudo cumin 'C:dnsrecursor' 30 hosts will be targeted: cloudservices[2004-2005]-dev.codfw.wmnet,cloudservices[1004-1005].wikimedia.org,dns[1004-1006,2004-2006,3001-3002,4003-4004,...
[19:03:35] <wikibugs>	 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh)
[21:15:44] <andrewbogott>	 I'm at a step on this doc page that says "Ask on #wikimedia-traffic which are the backup LVS server for the LVS class of your service on both datacentres and restart pybal on those"
[21:15:52] <andrewbogott>	 regarding https://gerrit.wikimedia.org/r/c/operations/puppet/+/831173/2/hieradata/role/eqiad/wmcs/openstack/eqiad1/labweb.yaml
[21:15:56] <andrewbogott>	 so... which are the backup servers?
[21:16:06] <andrewbogott>	 (or better yet, would someone braver than me like to do the restarts?)
[21:26:11] <wikibugs>	 10Traffic, 10cloud-services-team: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10Andrew) I merged the above patches and briefly triggered a couple of alerts like "PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2...
[21:26:21] <andrewbogott>	 ok, well -- when some of you come back online I'd appreciate a look at the fallout from T317463. Everything looks OK to me but there may be bits I've missed. TY!
[21:26:22] <stashbot>	 T317463: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463
[21:47:29] <wikibugs>	 10netops, 10Infrastructure-Foundations: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10BTullis) Removing #data-engineering as I think that #infrastructure-foundations is on top of it.