[00:01:44] (VarnishHighThreadCount) firing: (5) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:11:44] (VarnishHighThreadCount) firing: (5) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:16:44] (VarnishHighThreadCount) resolved: (3) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:34:07] 10Traffic, 10DNS, 10SRE, 10Stewards-and-global-tools, 10Wikimedia-Hackathon-2023: Wikimedia + DNS issues/ideas mapping *(Rotterdam+Athens+online) - https://phabricator.wikimedia.org/T332971 (10Vituzzu) [09:37:06] 10Traffic, 10DNS, 10SRE, 10Stewards-and-global-tools, 10Wikimedia-Hackathon-2023: Wikimedia + DNS issues/ideas mapping *(Rotterdam+Athens+online) - https://phabricator.wikimedia.org/T332971 (10Vituzzu) [10:09:58] hey y'all, doc to find what are the primary LVS servers is out of date, how can I find out? [10:10:12] I need to check pybal log for why a bunch of appservers are seen down [10:12:10] claime: https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/lvs/configuration.pp [10:12:22] XioNoX: ty [10:21:03] claime: what doc are you referring to? [10:21:41] also.. the secondary LVS should be also useful for that [10:21:42] vgutierrez: It's not actually stale, I'm an idiot [10:21:48] claime: <3 [10:22:01] vgutierrez: Although there's something you can probably help me understand [10:22:06] sure, shoot [10:22:17] I have a bunch of mw appservers being reported as down in alertmanager [10:22:31] But I don't find a mention for some of them in pybal's log [10:22:33] alert link? [10:22:47] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dsre&q=alertname%3DPybalBackendDown [10:24:15] Hmm the alert time is set to time of the last down for all of them, which may be why I'm not finding most of them in recent logs [10:26:21] https://www.irccloud.com/pastebin/d79WCFfQ/ [10:26:39] that's the current status from lvs2009, so the alerts are stale for some reason [10:27:04] vgutierrez@lvs2009:~$ curl -s http://localhost:9090/alerts [10:27:04] OK - All pools are healthy [10:27:09] and the reason is not pybal :) [10:28:26] lvs2010 is consistent with lvs2009 [10:29:22] ok, that's less worrying [10:29:30] So now to figure out why the alert is stale [10:29:50] <_joe_> what is the metric in the alert? [10:30:09] pybal_monitor_status == 0 [10:30:18] lol.. we got team-sre/pybal.yaml and team-traffic/pybal.yaml [10:30:42] errr [10:30:45] for the last 12h [10:31:21] so if a host is flagged as down during 5 seconds is gonna alert the next 12h [10:31:27] <_joe_> I guess that's the problem [10:31:34] unless I'm reading the alert wrongly [10:31:36] godog: ^^ [10:32:13] <_joe_> I meant the expression is the problem [10:32:14] https://www.irccloud.com/pastebin/ukCliZOq/ [10:32:50] <_joe_> yeah I think it's just a wrong expression? [10:32:56] why? [10:33:07] <_joe_> as you said [10:33:19] <_joe_> it will alert for 12 hours if it was down for 5 seconds [10:33:26] yep.. that's more the for: 12h part [10:33:29] than the query itself [10:33:41] (checking) [10:34:11] so we need something like pybal_monitor_status is 0 longer than X minutes in a row [10:34:23] <_joe_> sorry gotta go afk for a bit [10:38:50] the expression is correct in the sense that it already does check if pybal_monitor_status is 0 longer than X minutes in a row [10:39:10] and from lvs2009 I see this [10:39:11] pybal_monitor_status{host="mw2359.codfw.wmnet",monitor="IdleConnection",service="appservers-https_443"} 1.0 [10:39:15] pybal_monitor_status{host="mw2359.codfw.wmnet",monitor="ProxyFetch",service="appservers-https_443"} 0.0 [10:40:19] not sure how to check if proxyfetch is actually passing and the metric / pybal internal status isn't correct ? [10:40:46] So it's pybal's internal metric that's wrong? [10:41:21] I'm trying to verify whether that's the case yeah [10:42:21] what does pybal's log file say :) [10:42:55] it should do a proxyfetch attempt every 30s iirc and log that [10:44:04] yeah I see some failures for proxyfetch but not for mw hosts [10:44:34] but e.g. lvs2010 is fine [10:44:41] pybal_monitor_status{host="mw2359.codfw.wmnet",monitor="IdleConnection",service="appservers-https_443"} 1.0 [10:44:43] hmm log is ok and pool status is ok as well [10:44:44] pybal_monitor_status{host="mw2359.codfw.wmnet",monitor="ProxyFetch",service="appservers-https_443"} 1.0 [10:45:55] I'm checking mw2359 access logs and see if there are the pybal fetches there [10:48:09] i.e. tail -f /var/log/apache2/*.log | grep -i pagegetter [10:48:25] which does show probes, now to figure out which lvs they come from [10:49:40] thoughts on how to do this? the ips in access logs are mw2359's [10:50:06] pybal logs the status of the monitor (and the status of the host) for every log of proxyfetch, so every 30s. does that not match with the prom metric? [10:50:47] and pybal health checks are done from the lvs server's main ip [10:50:57] I'm not seeing successes (only failures) for proxyfetch in pybal.log [10:51:03] <_joe_> pybal sees all servers as enabled/up/pooled [10:51:18] <_joe_> godog: yes that's normal, successes are only shown when running at debug log level [10:51:27] <_joe_> lvs2009:~$ curl localhost:9090/pools/appservers-https_443 [10:51:34] <_joe_> this is how pybal sees its status [10:51:51] <_joe_> if the prom metrics say proxyfetch is failing, that is incorrect [10:51:59] indeed [10:52:40] i wrote that prom metrics code in pybal at wikimania 2017, when i was already a director/manager for years, so clearly that code can't be trusted [10:52:53] (in my defense, while working side by side with ema) [10:53:16] heheh I do remember! I was at the table too [10:53:56] ok so on mw2359 access log I see pagegetter fetching http://en.wikipedia.org/wiki/Special:BlankPage?force_php74=1 and http://en.wikipedia.org/wiki/Special:BlankPage [10:54:19] though only two such requests at a time, not four, which suggests one pybal is not actually checking ? [10:54:52] or maybe I'm wrong, still trying to figure out where the requests are coming from [10:58:45] That second url check should probably be commented out until we actually have different php versions to check [10:58:51] (won't fix the current issue tho) [11:02:58] Are both the secondary and primary LVS checking? Because on another completely unrelated mw server (in eqiad, and not one of the ones failing), I only see two at a time too [11:03:40] claime: sorry I think I might have muddied the waters there, still verifying [11:04:16] it makes sense that there are "two at a time" from a single proxyfetch checking both urls [11:04:22] yeah [11:09:41] but yeah I think it is pybal_monitor_status not updating as expected, pybal_monitor_up_results_total and pybal_monitor_down_results_total check out [11:09:55] i.e. up_results increments and down_results doesn't [11:12:55] ok I'll add that check to the expression [11:13:10] easier than patch and deploy pybal for sure [11:21:28] https://gerrit.wikimedia.org/r/c/operations/alerts/+/902690 claime vgutierrez [11:22:35] cheers claime [11:24:09] will be live in the next 30 min [11:24:25] 👍 [11:24:35] Thanks for the investigation and fix <3 [11:25:51] you are welcome! I'm glad we could bandaid [11:45:00] I wonder why it's manifesting now and not before though [11:52:49] godog: Alert disappeared, thanks [11:58:04] claime: cause it's Friday and you're on call ;P [11:58:15] vgutierrez: Probably [11:58:20] :) [12:46:23] lol [15:38:40] 10Domains, 10Traffic, 10SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10BCornwall) @Aklapper, I would agree to decline this but for the line mentioning that enwp.org is in widespread use. If it is (it'd be good to see some stats, @violetwtf!) then it might be worth accepting the donation... [22:15:19] 10Domains, 10SRE, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10CRoslof) @BCornwall No updates at the moment. We have a lot of items in our enforcement queue, so it can take a while. If there is a particular need to have these domain names registered so that they... [22:19:03] 10Domains, 10SRE, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) Thanks, @CRoslof! Was this not in the queue before? It's a pretty old ticket! [22:43:03] 10Traffic, 10SRE, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) 05Open→03Stalled [22:45:59] 10HTTPS, 10Traffic, 10SRE, 10Tracking-Neverending: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681 (10BCornwall) [22:46:07] 10HTTPS, 10Traffic, 10SRE: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) 05In progress→03Stalled Is there any outcome to envision other than moving the domain? If not, I can close this and open another ticket to move store.wikimedia.org to a differ... [22:50:02] 10Traffic, 10SRE, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) p:05Medium→03Low [23:15:38] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Papaul)