[00:51:03] <jinxer-wm>	 FIRING: SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[00:52:18] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[01:11:28] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919775 (10BCornwall)
[01:12:18] <jinxer-wm>	 RESOLVED: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[01:36:05] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919804 (10BCornwall)
[01:54:47] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919821 (10BCornwall)
[01:55:01] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9919822 (10BCornwall)
[03:49:53] <wikibugs>	 07HTTPS, 10MediaWiki-Action-API, 10MediaWiki-REST-API, 10RESTBase-API, 06Wikimedia Enterprise: Proposal: fail explicitly and revoke relevant API keys over plain-text HTTP connection for all Wikimedia APIs - https://phabricator.wikimedia.org/T368344 (10Diskdance) 03NEW
[03:52:42] <wikibugs>	 07HTTPS, 06Traffic, 10MediaWiki-Action-API, 10MediaWiki-REST-API, and 2 others: Proposal: fail explicitly and revoke relevant API keys over plain-text HTTP connection for all Wikimedia APIs - https://phabricator.wikimedia.org/T368344#9919994 (10Pppery)
[09:02:28] <elukey>	 hey folks, I am going to rollout the glibc changes to all cp nodes
[09:09:50] <vgutierrez>	 elukey: no daemon restarts?
[09:18:02] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9920631 (10ABran-WMF)
[09:18:23] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9920630 (10Marostegui) >>! In T365995#9883497, @jcrespo wrote: > backup1009 is the main backup node for bacula on eq...
[09:19:07] <elukey>	 vgutierrez: yep it can be picked up anytime, a lot of cp text nodes already run it and cp4052 was restarted yesterday, seems fine to avoid a complete roll restart (also there are some reboots planned afaics)
[09:19:31] <elukey>	 the upgrade is mostly an excercise, no real security flaws to fix
[09:19:38] <vgutierrez>	 elukey: ack
[09:19:42] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9920643 (10Marostegui)
[09:20:26] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9920645 (10ABran-WMF)
[09:21:19] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9920659 (10ABran-WMF)
[09:23:11] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9920670 (10ABran-WMF)
[09:50:30] <wikibugs>	 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9920844 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7ca43ab0-579a-4f82-97aa-11720f300bd7) set by cgoubert@cumin1002 for 21 days, 0:00...
[09:54:13] <wikibugs>	 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9920870 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=046a1781-9fad-454c-b26b-ad2c96d2d8b2) set by cgoubert@cumin1002 for 21 days, 0:00...
[09:55:25] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9920871 (10cmooney) >>! In T326322#9650260, @cmooney wrote: >>>! In T326322#9130092, @ayounsi wrote: >> @cmooney I came across https://w...
[10:50:39] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9921174 (10jcrespo) > Is there a procedure for that so we know how to do so?  Sadly, there is not. The code changes...
[10:56:13] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9921190 (10Marostegui) I will try - but just in case @ABran-WMF please take some notes!
[12:32:41] <fabfur>	 claime: ack, ping me when needed!
[12:33:10] <claime>	 fabfur: great, thank you
[12:56:44] <topranks>	 vgutierrez: hi I'd a small question about Liberica ?
[12:57:01] <vgutierrez>	 topranks: hi, go ahead please
[12:57:05] <topranks>	 I see in the Katran source there is a parameter COPY_INNER_PACKET_TOS
[12:57:24] <topranks>	 it defaults to on (or at least my assumption that's what the 1 means)
[12:57:25] <topranks>	 https://github.com/facebookincubator/katran/blob/13a651916ce5a182a047e64737f8415188c8e97b/katran/lib/bpf/balancer_consts.h#L304
[12:57:40] <topranks>	 I assume we don't modify this setting?
[12:58:09] <vgutierrez>	 not during my initial tests, I haven't implemented Katran as a forwarding plane yet
[12:58:28] <topranks>	 ok 
[12:58:41] <topranks>	 well that default I think makes sense to leave alone 
[12:58:42] <vgutierrez>	 what's the desired value for you? :)
[12:58:54] <topranks>	 in terms of our packet-prioritisation / qos on the network 
[12:59:18] <topranks>	 it's best if Katran copies the DSCP/TOS value from the original packet to the new IP header in the IPIP tunnelled one
[12:59:22] <topranks>	 so the default is best for us 
[13:00:08] <topranks>	 we'll have already set that to "default" priority for external traffic from internet, but internal services SREs may have marked it to be considered "high" or "low" priority
[13:00:24] <topranks>	 keeping the default setting preserves all that for the traffic forwarded by the LB 
[13:00:56] <topranks>	 which leads me to another question ;)
[13:01:04] <topranks>	 I know for PyBal the nodes don't run iptables/nftables etc., for performance reasons I think ?
[13:01:39] <topranks>	 is that the same for Liberica?  I'd have thought with eBPF pulling the traffic to be load-balanced out of the kernel pipeline we could allow nftables to filter traffic to/from the host itself ?
[13:04:50] <vgutierrez>	 that would be feasible yes
[13:06:07] <topranks>	 ok, I guess we can discuss again when closer to the time 
[13:06:20] <vgutierrez>	 at least I'm happy to test nftables on katran based nodes
[13:08:11] <topranks>	 but overall I think it would be good if we could do it, better security to protect the kernel / system IP itself 
[13:08:14] <topranks>	 it also would enable us to mark the DSCP/TOS bits in packets the system generates (and with above setting Katran will do that for the traffic it forwards to match source) 
[13:09:01] <topranks>	 with the LVS currently that's a small gap in our end-to-end QoS.  It's a minor thing won't cause many problems, but be nice if Liberica could use nft 
[13:15:53] <XioNoX>	 +1
[14:55:41] <wikibugs>	 10netops, 06Data-Persistence, 06Data-Platform-SRE, 06DBA, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922024 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7a21c2a6-e267-4150-8111-b348788c4a9b) set by cmoo...
[14:58:37] <wikibugs>	 10netops, 06Data-Persistence, 06Data-Platform-SRE, 06DBA, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922051 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=01b84d43-d6d0-4f45-bc2e-375ff79e21f8) set by cmoo...
[14:59:05] <wikibugs>	 10netops, 06Data-Persistence, 06Data-Platform-SRE, 06DBA, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922053 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=65c438b1-9725-4de3-9a45-8318edea15f1) set by cmoo...
[16:26:11] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9922688 (10RobH)
[16:27:10] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9922684 (10RobH) a:05RobH→03None
[16:27:49] <wikibugs>	 06Traffic, 10Observability-Tracing: traceparent response headers are being emitted externally - https://phabricator.wikimedia.org/T368428 (10CDanis) 03NEW
[16:32:08] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-codfw, 06serviceops, 06SRE: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9922720 (10Jhancock.wm) swapped DIMM_B1 for DIMM_B2 to test. error has cleared.
[16:32:37] <cdanis>	 vgutierrez: when you have a moment can I get a +1 on https://gerrit.wikimedia.org/r/1049603 ?
[16:35:37] <wikibugs>	 06Traffic, 10Observability-Tracing, 13Patch-For-Review: traceparent response headers are being emitted externally - https://phabricator.wikimedia.org/T368428#9922746 (10CDanis)
[16:35:45] <cdanis>	 thanks <3
[16:35:57] <cdanis>	 as for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1049604 do ATS Lua changes still require restarts?
[16:36:09] <cdanis>	 I'm not in a rush about that one ofc
[16:36:38] <wikibugs>	 10netops, 06Data-Persistence, 06Data-Platform-SRE, 06DBA, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9922750 (10cmooney) 05Open→03Resolved
[16:50:56] <vgutierrez>	 topranks: we are seeing some issues with lvs2011 IPv6 traffic
[16:50:58] <brett>	 topranks: Heya! Mind joining us in this corner?
[16:51:19] <vgutierrez>	 we noticed because we need to use bast2003 to log via ssh into lvs2011, other bastions won't work
[16:51:33] <vgutierrez>	 https://www.irccloud.com/pastebin/WrZrqESz/
[16:51:47] <vgutierrez>	 MTR shows that bast6003 can't reach lvs2011 on port 22 TCP
[16:55:11] <topranks>	 the host is sending RST back 
[16:55:16] <topranks>	 6:54:34.804413 IP6 2620:0:861:4:208:80:155:110.43496 > 2620:0:860:113:10:192:23:9.22: Flags [S], seq 4090563902, win 43200, options [mss 1440,sackOK,TS val 1070070870 ecr 0,nop,wscale 9], length 0
[16:55:16] <topranks>	 16:54:34.804465 IP6 2620:0:860:113:10:192:23:9.22 > 2620:0:861:4:208:80:155:110.43496: Flags [R.], seq 0, ack 4090563903, win 0, length 0
[16:57:03] <topranks>	 although it's funny cos the SSH connection (in this case from bast1003) doesn't immediately fail, which you'd expect if that RST went back 
[16:57:32] <brett>	 I'm going to let JennH know that these issues don't require her presence in the DC any more
[16:59:30] <brett>	 Thanks for your work :)
[16:59:35] <topranks>	 vgutierrez: something fishy 
[16:59:54] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-codfw, 06serviceops, 06SRE: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9922861 (10BCornwall) 05Open→03Resolved Linux is happy, too. Thank you, @Jhancock.wm!
[17:00:04] <topranks>	 right now the default IPv6 route is using vlan2018 
[17:00:32] <topranks>	 which is fairly normal (should be fixed - see T358260)
[17:00:33] <stashbot>	 T358260: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260
[17:00:51] <topranks>	 now the odd thing is the IPv6 default only gets added when it gets an RA on the interface 
[17:01:10] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9922864 (10BCornwall) a:03BCornwall
[17:04:10] <vgutierrez>	 topranks: somehow that RST doesn't arrive to the client
[17:04:21] <vgutierrez>	 topranks: or mtr wouldn't show lost packets?
[17:04:29] <topranks>	 yep you're correct 
[17:04:45] <topranks>	 the RST is odd, but it's the IPv6 default route not working for some reason 
[17:04:59] <vgutierrez>	 we only noticed this after rebooting the host BTW
[17:06:01] <sukhe>	 to add to taht, topranks, also as a refresher, that we worked on this in https://phabricator.wikimedia.org/T352920 most recently but had not rebooted it since then
[17:06:03] <topranks>	 I think I see what's wrong... but I'm scratching my head as to what could have changed 
[17:06:06] <topranks>	 (vlan missing on switch)
[17:06:20] <topranks>	 ah I think I know 
[17:07:28] <topranks>	 so this host is in row A.... we only put the IP gateways for vlan2018, private1-b-codfw, on the switches in row B 
[17:07:34] <topranks>	 as that is the only place they are needed 
[17:07:59] <topranks>	 *but* the LVS can use any random vlan for IPv6 traffic because..... well T358260
[17:08:00] <stashbot>	 T358260: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260
[17:08:43] <topranks>	 being specific it in theory should be able to hit one of the anycast gateway interfaces on one of the row B switches 
[17:08:55] <topranks>	 they are what are sending the RAs it is using for its default route 
[17:12:54] <topranks>	 yeah this is some quirk, the same thing happens in v4 
[17:13:08] <topranks>	 lvs2011 can't ping the anycast GW IP on the row B leaf switches on vlan 2018 
[17:13:36] <topranks>	 it can reach hosts on the vlan just fine though - just not the switch GW IP 
[17:13:39] <topranks>	 https://www.irccloud.com/pastebin/81RXiH7d/
[17:14:34] <topranks>	 same with v6 
[17:14:39] <topranks>	 https://www.irccloud.com/pastebin/AiyFS5wv/
[17:14:42] <sukhe>	 
[17:16:00] <topranks>	 So for now I fixed with by deleting the default route, letting lvs choose another one of the many RAs its accepted instead
[17:16:14] <topranks>	 root@lvs2011:~# ip route del default via fe80::2018:0:1  
[17:16:14] <topranks>	 root@lvs2011:~# ip -6 route get fibmatch ::
[17:16:14] <topranks>	 default via fe80::1 dev vlan2019 proto ra metric 1024 expires 577sec hoplimit 64 pref medium
[17:16:24] <sukhe>	 yeah it works now
[17:16:40] <sukhe>	 so I guess we should revisit https://phabricator.wikimedia.org/T358260 
[17:16:45] <topranks>	 We could do some things we'd rather not to fix this on the network side 
[17:16:45] <sukhe>	 I do see your patch there, I know I know
[17:17:06] <topranks>	 specifically extending the vlan to all switches in row A, C and D and adding a GW interface 
[17:17:35] <topranks>	 but tbh I don't think this is really an issue network-side, and it's a bad idea to add work-arounds on the network rather than not fix the root cause 
[17:19:52] <vgutierrez>	 thx topranks :)
[17:20:14] <topranks>	 it makes sense it happened after a reboot I think, the behaviour is it uses one of the RAs at random, but then sticks with that as its default 
[17:20:37] <topranks>	 so after a reboot here it picked vlan2018, which is in row B, and is now an anycast gw on in a vxlan 
[17:21:02] <topranks>	 there *is* some quirk there but I think the easier way to solve is to set the sysctl's on the lvs 
[17:21:55] <topranks>	 we won't have any other hosts trying to use a switch in a remote row as its gateway 
[17:22:25] <topranks>	 so - despite the fact in a regular ethernet that should work - the deficiency on the network is not gonna impact otherwise 
[17:24:26] <sukhe>	 topranks: thanks <3
[17:24:39] <sukhe>	 brett: I think once you have verified everything else is fine, repool it IMO
[17:25:39] <topranks>	 yeah - from experience it should be stable and keep using the current IPv6 default 
[17:27:11] <brett>	 sukhe: Other than SSH I didn't really see anything wrong until traffic flowed. So I guess if everything seems right might as well open the faucet
[17:27:22] <sukhe>	 brett: yeah 
[17:27:23] <cdanis>	 topranks: do you have any idea how the linux routing engine makes that decision?  is the the same kind of thing like on junipers where the longest-lived BGP session breaks a tie?
[17:28:38] <topranks>	 cdanis: I actually don't, RAs are sent periodically so I expect it just picks the first that it gets after coming online 
[17:28:45] <cdanis>	 yeah that makes sense
[17:29:53] <topranks>	 it's sort of the same as what you mention on the Juniper 
[17:30:15] <topranks>	 as-in if it has a route in the table, and then learns another - with exactly equal attributes - it keeps what it has 
[17:30:23] <cdanis>	 yep sure
[17:30:39] <brett>	 Seeing the elevated tcp/socket errors again in the host overview
[17:31:20] <topranks>	 ssh is working from eqiad still, the v6 route on lvs2011 hasn't changed 
[17:32:19] <brett>	 topranks: Is that to say that this should not be pooled?
[17:32:58] <topranks>	 no it still looks ok to me is what I was saying 
[17:33:08] <topranks>	 or at least the elevated socket errors are not due to the same thing 
[17:33:26] <topranks>	 but probably we should work out what they are before pooling 
[17:33:37] <brett>	 sukhe: I'm gonna depool again
[17:34:10] <sukhe>	 brett: wait
[17:34:20] <sukhe>	 zoom out a bit and see if the errors stand in any way
[17:34:44] <brett>	 They do stand out. Since the reboot it goes above a pretty flat plateua
[17:34:47] <brett>	 plateau
[17:35:15] <brett>	 tcp/inerrs
[17:35:20] <cdanis>	 compare it against the background rate of the other LVS while pooled
[17:35:40] <cdanis>	 wait tcp/inerrs? that's like, bad checksums, or the packet is being rejected for other reasons
[17:35:58] <brett>	 https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=lvs2011&var-datasource=thanos&var-cluster=lvs&from=now-24h&to=now&viewPanel=20
[17:36:15] <sukhe>	 brett: ok, go ahead 
[17:36:22] <brett>	 tcp/attemptfails as well
[17:37:02] <sukhe>	 weird
[17:38:37] <sukhe>	 almost a similar rate (and worse on lvs2014 fwiw)
[17:39:06] <brett>	 oh yeah, so this might be an issue beyond lvs?
[17:40:26] <sukhe>	 I think you will see some tcp/inerrs on basically any host you pick. I don't think -- and I may be wrong -- that this is an alarming rate 
[17:40:38] <sukhe>	 unless you find some other symptom that is 
[17:41:15] <sukhe>	 compare the rates of lvs2011 or example with any other lvs, including 2014 for example since that is now the primary ht-1
[17:41:49] <brett>	 Yeah. Makes sense, I was alarmed by the difference before and after. But it seems to have happened independently
[17:42:00] <brett>	 soooooo you cool with re-re-pooling?
[17:42:14] <sukhe>	 unless you see anything else that is wrong, I am.
[17:43:12] <cdanis>	 pybal is going to generate some tcp/attemptfails as it healthchecks things that are not online right now
[17:43:31] <cdanis>	 (those are connection timeouts and i think perhaps also conn refused)
[17:44:37] <sukhe>	 yeah, as a baseline
[17:45:00] <sukhe>	 brett: keep an eye out on the LVS graphs/ipvsadm output to see it is picking up traffic fine and 2014 is draining
[17:45:27] <sukhe>	 don't worry if it pages :]
[17:45:46] <topranks>	 FWIW on the original issue the problem is for some reason the VXLAN-based switches are struggling to properly deal with packets that are sent to them over a L2VNI with a destination MAC of it's own local interface 
[17:46:24] <topranks>	 Juniper do have a design they call "centrally routed bridged overlay" which involves exactly that happening, but I suspect there are some config knobs we don't have that are required to make it work (ARP/ND snooping perhaps) 
[17:47:34] <topranks>	 But as I said I think we can hopefully resolve by adjusting the LVS to use its primary interface instead 
[18:03:54] <sukhe>	 topranks: will discuss with Traffic more formally.
[18:03:56] <sukhe>	 thank you :)
[18:04:02] <sukhe>	 for the resolution for today as well
[18:04:05] <sukhe>	 brett: looks good I would say
[18:04:09] <topranks>	 no probs 
[18:04:22] <brett>	 yeah
[18:29:51] <cdanis>	 bblack: any chance you're around?
[18:31:19] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923318 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS b...
[18:37:42] <cdanis>	 I'm having a "how did this ever work" moment about very old VCL
[18:42:17] <bblack>	 cdanis: yeah
[18:42:42] <bblack>	 (some of it probably didn't)
[18:50:18] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923367 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bulls...
[18:50:26] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS b...
[18:56:00] <wikibugs>	 06Traffic, 10Observability-Tracing, 13Patch-For-Review: traceparent response headers are being emitted externally - https://phabricator.wikimedia.org/T368428#9923414 (10CDanis) 05Open→03Resolved
[19:52:38] <jinxer-wm>	 FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.224:443 @ cp5017 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=eqsin&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[19:57:38] <jinxer-wm>	 RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.224:443 @ cp5017 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=eqsin&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[19:59:38] <jinxer-wm>	 FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.224:443 @ cp5017 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=eqsin&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[20:03:24] <ryankemper>	 Hi Traffic, DPE / the Search Team are continuing work on the WDQS graph split transition. Requests will be federated between the two different subgraphs. As a result of that our previous method of performing throttling (basically a token-bucket algorithm) won't work since a large number of requests will be coming internally rather than externally, making it challenging to track/attribute the external origin of federation requests
[20:03:29] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9923652 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bulls...
[20:04:38] <jinxer-wm>	 RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.224:443 @ cp5017 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=eqsin&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[20:05:34] <ryankemper>	 Is someone available to discuss tomorrow with dcausse and me? We have a weekly search meeting beginning at 15:00 UTC tomorrow (weds) that would be a good venue to have the discussion, if someone from traffic is free to attend
[20:06:16] <ryankemper>	 Meeting is 15:00-15:30 on the calendar but the true length of the meeting is more like 15:00-17:00 FYI
[20:12:28] <sukhe>	 ryankemper: I am not a good fit for that so will defer to others.
[20:12:55] <sukhe>	 also a bit late for most people so please email sre-traffic@ 
[20:14:10] <ryankemper>	 ack. Will send out that e-mail in a bit, and also include a bit more context since I forgot to link a phab ticket
[20:14:29] <sukhe>	 thanks