[00:38:12] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9721040 (10ssingh) Thanks for the task @RobH! As in the previous runs, please feel free to leave these for Traffic:  `  Update the operations/puppet repo - this should include updates to preseed.ya...
[02:22:10] <Guest5>	 lung disease
[05:34:15] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9721277 (10Papaul) @ssingh After 2 days working on this issue, I finally got at the bottom of the of problem. After many reboots on cp11...
[06:30:17] <brett>	 lawl
[09:21:40] <jinxer-wm>	 (VarnishHighThreadCount) firing: (3) Varnish's thread count on cp3070:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[09:26:40] <jinxer-wm>	 (VarnishHighThreadCount) firing: (8) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[09:32:34] <wikibugs>	 10netops, 06Infrastructure-Foundations: mr1-eqsin performance issue - https://phabricator.wikimedia.org/T362522#9721802 (10cmooney) >>! In T362522#9717511, @cmooney wrote: > FWIW I changed the key-exchange algo configured on mr1-eqsin to see if it would make any difference  CPU is roughly the same pattern sinc...
[09:36:40] <jinxer-wm>	 (VarnishHighThreadCount) firing: (12) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[09:41:40] <jinxer-wm>	 (VarnishHighThreadCount) firing: (15) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[09:51:40] <jinxer-wm>	 (VarnishHighThreadCount) firing: (15) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[09:56:40] <jinxer-wm>	 (VarnishHighThreadCount) firing: (11) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[10:01:40] <jinxer-wm>	 (VarnishHighThreadCount) resolved: (8) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[11:56:01] <wikibugs>	 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772 (10cmooney) 03NEW p:05Triage→03Medium
[11:57:38] <jinxer-wm>	 (LVSRealserverMSS) firing: (4) Unexpected MSS value on 208.80.153.232:443 @ ncredir2001 - TODO - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[12:02:38] <jinxer-wm>	 (LVSRealserverMSS) resolved: (4) Unexpected MSS value on 208.80.153.232:443 @ ncredir2001 - TODO - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[12:06:31] <vgutierrez>	 godog: we are experiencing some issues with mtail in bookworm: https://phabricator.wikimedia.org/T357976#9722233
[12:10:41] <godog>	 vgutierrez: ack, interesting
[12:22:11] <godog>	 from reading the task it seems to me mtail is slower to read from the pipe and that creates essentially backpressure into nginx ?
[12:26:17] <vgutierrez>	 godog: that's my current understanding 
[12:29:49] <godog>	 vgutierrez: I'm wondering if that's related to https://github.com/google/mtail/issues/685
[12:34:52] <vgutierrez>	 godog: seems the same issue
[13:03:27] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722316 (10ssingh) @Papaul: Thanks for the update! Looks promising indeed and to actually close this, we should downgrade another host i...
[13:27:04] <wikibugs>	 06Traffic: replace mtail with benthos on ncredir instances - https://phabricator.wikimedia.org/T362776 (10Vgutierrez) 03NEW
[14:27:38] <jinxer-wm>	 (LVSRealserverMSS) firing: Unexpected MSS value on 208.80.153.232:443 @ ncredir2001 - TODO - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[14:32:38] <jinxer-wm>	 (LVSRealserverMSS) resolved: Unexpected MSS value on 208.80.153.232:443 @ ncredir2001 - TODO - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[15:08:06] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722796 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1114.eqiad.wmnet with OS bullseye
[15:21:02] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722934 (10ssingh) 05Open→03Resolved @Papaul deserves a lot of love for fixing this persistent issue. The 21.x firmware (specifically, `N...
[15:25:49] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722986 (10MoritzMuehlenhoff) >>! In T350179#9722934, @ssingh wrote: > @Papaul deserves a lot of love for fixing this persistent issue. The 2...
[15:26:50] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722990 (10MatthewVernon) +1 to thanks to Papaul for getting to the bottom of this!
[15:43:38] <swfrench-wmf>	 hello traffic friends: in an upcoming etcd maintenance, we may need to temporarily direct all etcd clients to eqiad, which includes pybal in certain DCs (those with affinity to codfw). like adding a new service, this means a puppet change and pybal restart.
[15:43:55] <swfrench-wmf>	 I wanted to confirm that the fairly straightforward (but still a bit scary) procedure in https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers is still generally up-to-date (i.e., there aren't surprises that make this more complex). e.g., does Liberica need anything special?
[15:44:27] <sukhe>	 swfrench-wmf: we are not using Liberica in production yet
[15:44:38] <sukhe>	 but to answer your question, this is simply changing the etcd endpoint right?
[15:44:52] <sukhe>	 to attempt to answer your question with a question :)
[15:45:20] <swfrench-wmf>	 ah, great - also yes exactly: this points them at a different etcd node temporarily
[15:45:40] <sukhe>	 yeah, in that case, a simple puppet change and restarting of Pybal is enough to pick that up
[15:46:56] <swfrench-wmf>	 great, thank you very much!
[15:47:30] <sukhe>	 swfrench-wmf: Traffic is happy to take care of that; do you have a task and timing?
[15:48:34] <swfrench-wmf>	 still TBD whether we'll actually need to do this (depends on a couple of factors around scheduling the change), but if we do, I'll follow up with you folks at least a week ahead of time. the task is T358636.
[15:48:35] <stashbot>	 T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636
[15:49:23] <swfrench-wmf>	 also thank you :)
[15:50:21] <sukhe>	 thanks! basically, we will need to update hieradata/role/eqiad/lvs/balancer.yaml and point profile::pybal::config_host to wherever we want and restart pybal
[15:50:28] <sukhe>	 anyway, just ping us here please and we can take care of it
[15:53:09] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9723209 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1114.eqiad.wmnet with OS bullseye c...
[17:01:38] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9723606 (10RobH) >>! In T362729#9721040, @ssingh wrote: > Thanks for the task @RobH! As in the previous runs, please feel free to leave these for Traffic: >  > ` >  Update the operations/puppet rep...
[17:10:15] <cdanis>	 sukhe: XioNoX: topranks: let me know if at some point you'd like to revive Probenet (T334417) to test end-user-latency of magru
[17:10:16] <stashbot>	 T334417: Receive network latency reports from the browsers - https://phabricator.wikimedia.org/T334417
[17:10:32] <sukhe>	 cdanis: we most certainly do and will :) 
[17:10:50] <topranks>	 yep, thanks Chris!
[17:10:52] <cdanis>	 <3 great, all we really need to do it is Varnish running there
[17:11:02] <sukhe>	 it was on our list of to-do; for now, we just added magru as the last default but we will need your help there!
[17:11:27] <cdanis>	 yep :) there are a few other small changes needed, but it should be easy
[17:11:39] <cdanis>	 and I have to resurrect the data pipeline at some point, but again, shouldn't be too hard
[17:11:39] <sukhe>	 thanks, will try to do the legwork and send for review 
[17:11:47] <sukhe>	 and what we don't know, we will ask you
[17:13:14] <sukhe>	 cdanis: while we have you here, re: https://phabricator.wikimedia.org/T359054
[17:13:20] <cdanis>	 ask anything you want, I'd love to use this as an excuse to get it more well-understood or to do some other infra work on it ;)
[17:13:38] <sukhe>	 so far the idea is to pick a region in Brazil and a small-ish Spanish-speaking country to ramp up traffic
[17:13:41] <cdanis>	 ah I'm reading
[17:13:49] <sukhe>	 if you have any ideas/suggestions, please add them there
[17:16:16] <cdanis>	 sukhe: so, what's the primary concern?  warming up the caches?  getting a reasonable test of the site on a small number of users before widening the possible impact of any issue?  making reasonable latency mappings of users?
[17:16:55] <sukhe>	 the first two primarily, cache and testing
[17:17:21] <cdanis>	 I'll go digging in the old Probenet data to get the beginnings of an answer about the northern part (Ecuador, Colombia, Venezuela, Guyana)
[17:17:44] <sukhe>	 thanks!
[17:34:29] <jinxer-wm>	 (HAProxyRestarted) firing: HAProxy server restarted on cp1114:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1114&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted
[17:39:14] <sukhe>	 er 
[18:58:39] <wikibugs>	 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#9724034 (10cmooney) I believe the two patches above, once merged, will add the required redundancy.  Following option 1 above, creatin...
[19:02:46] <wikibugs>	 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#9724049 (10cmooney) Perhaps one option would be to ignore the puppet patch to change drmrs and esams for now - but merge the Homer one...
[19:12:03] <wikibugs>	 06Traffic, 06Infrastructure-Foundations, 06SRE: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9724065 (10CDanis) I largely agree with Arzhel's assessment.  At a cursory glance, Uruguay or Paraguay look ideal as first candidates....
[19:54:40] <wikibugs>	 10Wikimedia-Apache-configuration: Unit tests for apache config/rewrites - https://phabricator.wikimedia.org/T57857#9724219 (10RLazarus) 05Open→03Resolved
[21:34:29] <jinxer-wm>	 (HAProxyRestarted) firing: HAProxy server restarted on cp1114:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1114&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted
[21:34:39] <brett>	 again :(
[21:41:44] <brett>	 Is this related to the NIC firmware downgrade, I wonder?
[21:42:38] <wikibugs>	 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9724526 (10Jdlrobson)
[21:46:54] <brett>	 Oh, I see. The restart check is still complaining about the >=1 number of failed restarts, it didn't restart again
[22:28:55] <wikibugs>	 06Traffic: Improve HAProxy unexpected restart alert - https://phabricator.wikimedia.org/T362833 (10BCornwall) 03NEW
[22:29:12] <wikibugs>	 06Traffic: Improve HAProxy unexpected restart alert - https://phabricator.wikimedia.org/T362833#9724661 (10BCornwall) 05Open→03In progress p:05Triage→03High
[23:43:36] <wikibugs>	 06Traffic: Improve HAProxy unexpected restart alert - https://phabricator.wikimedia.org/T362833#9724793 (10ssingh) Thanks for the task! At least for now, I restarted `haproxy` so that we don't get this alert and we also don't leave it silenced in case the initial restart (below) was nothing more than a transient...