[00:38:12] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9721040 (10ssingh) Thanks for the task @RobH! As in the previous runs, please feel free to leave these for Traffic: ` Update the operations/puppet repo - this should include updates to preseed.ya... [02:22:10] lung disease [05:34:15] 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9721277 (10Papaul) @ssingh After 2 days working on this issue, I finally got at the bottom of the of problem. After many reboots on cp11... [06:30:17] lawl [09:21:40] (VarnishHighThreadCount) firing: (3) Varnish's thread count on cp3070:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:26:40] (VarnishHighThreadCount) firing: (8) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:32:34] 10netops, 06Infrastructure-Foundations: mr1-eqsin performance issue - https://phabricator.wikimedia.org/T362522#9721802 (10cmooney) >>! In T362522#9717511, @cmooney wrote: > FWIW I changed the key-exchange algo configured on mr1-eqsin to see if it would make any difference CPU is roughly the same pattern sinc... [09:36:40] (VarnishHighThreadCount) firing: (12) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:41:40] (VarnishHighThreadCount) firing: (15) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:51:40] (VarnishHighThreadCount) firing: (15) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:56:40] (VarnishHighThreadCount) firing: (11) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [10:01:40] (VarnishHighThreadCount) resolved: (8) Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:56:01] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772 (10cmooney) 03NEW p:05Triage→03Medium [11:57:38] (LVSRealserverMSS) firing: (4) Unexpected MSS value on 208.80.153.232:443 @ ncredir2001 - TODO - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [12:02:38] (LVSRealserverMSS) resolved: (4) Unexpected MSS value on 208.80.153.232:443 @ ncredir2001 - TODO - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [12:06:31] godog: we are experiencing some issues with mtail in bookworm: https://phabricator.wikimedia.org/T357976#9722233 [12:10:41] vgutierrez: ack, interesting [12:22:11] from reading the task it seems to me mtail is slower to read from the pipe and that creates essentially backpressure into nginx ? [12:26:17] godog: that's my current understanding [12:29:49] vgutierrez: I'm wondering if that's related to https://github.com/google/mtail/issues/685 [12:34:52] godog: seems the same issue [13:03:27] 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722316 (10ssingh) @Papaul: Thanks for the update! Looks promising indeed and to actually close this, we should downgrade another host i... [13:27:04] 06Traffic: replace mtail with benthos on ncredir instances - https://phabricator.wikimedia.org/T362776 (10Vgutierrez) 03NEW [14:27:38] (LVSRealserverMSS) firing: Unexpected MSS value on 208.80.153.232:443 @ ncredir2001 - TODO - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [14:32:38] (LVSRealserverMSS) resolved: Unexpected MSS value on 208.80.153.232:443 @ ncredir2001 - TODO - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=ncredir - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [15:08:06] 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722796 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1114.eqiad.wmnet with OS bullseye [15:21:02] 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722934 (10ssingh) 05Open→03Resolved @Papaul deserves a lot of love for fixing this persistent issue. The 21.x firmware (specifically, `N... [15:25:49] 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722986 (10MoritzMuehlenhoff) >>! In T350179#9722934, @ssingh wrote: > @Papaul deserves a lot of love for fixing this persistent issue. The 2... [15:26:50] 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9722990 (10MatthewVernon) +1 to thanks to Papaul for getting to the bottom of this! [15:43:38] hello traffic friends: in an upcoming etcd maintenance, we may need to temporarily direct all etcd clients to eqiad, which includes pybal in certain DCs (those with affinity to codfw). like adding a new service, this means a puppet change and pybal restart. [15:43:55] I wanted to confirm that the fairly straightforward (but still a bit scary) procedure in https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers is still generally up-to-date (i.e., there aren't surprises that make this more complex). e.g., does Liberica need anything special? [15:44:27] swfrench-wmf: we are not using Liberica in production yet [15:44:38] but to answer your question, this is simply changing the etcd endpoint right? [15:44:52] to attempt to answer your question with a question :) [15:45:20] ah, great - also yes exactly: this points them at a different etcd node temporarily [15:45:40] yeah, in that case, a simple puppet change and restarting of Pybal is enough to pick that up [15:46:56] great, thank you very much! [15:47:30] swfrench-wmf: Traffic is happy to take care of that; do you have a task and timing? [15:48:34] still TBD whether we'll actually need to do this (depends on a couple of factors around scheduling the change), but if we do, I'll follow up with you folks at least a week ahead of time. the task is T358636. [15:48:35] T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636 [15:49:23] also thank you :) [15:50:21] thanks! basically, we will need to update hieradata/role/eqiad/lvs/balancer.yaml and point profile::pybal::config_host to wherever we want and restart pybal [15:50:28] anyway, just ping us here please and we can take care of it [15:53:09] 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9723209 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1114.eqiad.wmnet with OS bullseye c... [17:01:38] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9723606 (10RobH) >>! In T362729#9721040, @ssingh wrote: > Thanks for the task @RobH! As in the previous runs, please feel free to leave these for Traffic: > > ` > Update the operations/puppet rep... [17:10:15] sukhe: XioNoX: topranks: let me know if at some point you'd like to revive Probenet (T334417) to test end-user-latency of magru [17:10:16] T334417: Receive network latency reports from the browsers - https://phabricator.wikimedia.org/T334417 [17:10:32] cdanis: we most certainly do and will :) [17:10:50] yep, thanks Chris! [17:10:52] <3 great, all we really need to do it is Varnish running there [17:11:02] it was on our list of to-do; for now, we just added magru as the last default but we will need your help there! [17:11:27] yep :) there are a few other small changes needed, but it should be easy [17:11:39] and I have to resurrect the data pipeline at some point, but again, shouldn't be too hard [17:11:39] thanks, will try to do the legwork and send for review [17:11:47] and what we don't know, we will ask you [17:13:14] cdanis: while we have you here, re: https://phabricator.wikimedia.org/T359054 [17:13:20] ask anything you want, I'd love to use this as an excuse to get it more well-understood or to do some other infra work on it ;) [17:13:38] so far the idea is to pick a region in Brazil and a small-ish Spanish-speaking country to ramp up traffic [17:13:41] ah I'm reading [17:13:49] if you have any ideas/suggestions, please add them there [17:16:16] sukhe: so, what's the primary concern? warming up the caches? getting a reasonable test of the site on a small number of users before widening the possible impact of any issue? making reasonable latency mappings of users? [17:16:55] the first two primarily, cache and testing [17:17:21] I'll go digging in the old Probenet data to get the beginnings of an answer about the northern part (Ecuador, Colombia, Venezuela, Guyana) [17:17:44] thanks! [17:34:29] (HAProxyRestarted) firing: HAProxy server restarted on cp1114:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1114&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [17:39:14] er [18:58:39] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#9724034 (10cmooney) I believe the two patches above, once merged, will add the required redundancy. Following option 1 above, creatin... [19:02:46] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#9724049 (10cmooney) Perhaps one option would be to ignore the puppet patch to change drmrs and esams for now - but merge the Homer one... [19:12:03] 06Traffic, 06Infrastructure-Foundations, 06SRE: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9724065 (10CDanis) I largely agree with Arzhel's assessment. At a cursory glance, Uruguay or Paraguay look ideal as first candidates.... [19:54:40] 10Wikimedia-Apache-configuration: Unit tests for apache config/rewrites - https://phabricator.wikimedia.org/T57857#9724219 (10RLazarus) 05Open→03Resolved [21:34:29] (HAProxyRestarted) firing: HAProxy server restarted on cp1114:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1114&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [21:34:39] again :( [21:41:44] Is this related to the NIC firmware downgrade, I wonder? [21:42:38] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9724526 (10Jdlrobson) [21:46:54] Oh, I see. The restart check is still complaining about the >=1 number of failed restarts, it didn't restart again [22:28:55] 06Traffic: Improve HAProxy unexpected restart alert - https://phabricator.wikimedia.org/T362833 (10BCornwall) 03NEW [22:29:12] 06Traffic: Improve HAProxy unexpected restart alert - https://phabricator.wikimedia.org/T362833#9724661 (10BCornwall) 05Open→03In progress p:05Triage→03High [23:43:36] 06Traffic: Improve HAProxy unexpected restart alert - https://phabricator.wikimedia.org/T362833#9724793 (10ssingh) Thanks for the task! At least for now, I restarted `haproxy` so that we don't get this alert and we also don't leave it silenced in case the initial restart (below) was nothing more than a transient...