[00:01:44] <jinxer-wm>	 (VarnishHighThreadCount) firing: (5) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[00:11:44] <jinxer-wm>	 (VarnishHighThreadCount) firing: (5) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[00:16:44] <jinxer-wm>	 (VarnishHighThreadCount) resolved: (3) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[09:34:07] <wikibugs>	 10Traffic, 10DNS, 10SRE, 10Stewards-and-global-tools, 10Wikimedia-Hackathon-2023: Wikimedia + DNS issues/ideas mapping *(Rotterdam+Athens+online) - https://phabricator.wikimedia.org/T332971 (10Vituzzu)
[09:37:06] <wikibugs>	 10Traffic, 10DNS, 10SRE, 10Stewards-and-global-tools, 10Wikimedia-Hackathon-2023: Wikimedia + DNS issues/ideas mapping *(Rotterdam+Athens+online) - https://phabricator.wikimedia.org/T332971 (10Vituzzu)
[10:09:58] <claime>	 hey y'all, doc to find what are the primary LVS servers is out of date, how can I find out?
[10:10:12] <claime>	 I need to check pybal log for why a bunch of appservers are seen down
[10:12:10] <XioNoX>	 claime: https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/lvs/configuration.pp
[10:12:22] <claime>	 XioNoX: ty
[10:21:03] <vgutierrez>	 claime: what doc are you referring to?
[10:21:41] <vgutierrez>	 also.. the secondary LVS should be also useful for that
[10:21:42] <claime>	 vgutierrez: It's not actually stale, I'm an idiot
[10:21:48] <vgutierrez>	 claime: <3
[10:22:01] <claime>	 vgutierrez: Although there's something you can probably help me understand
[10:22:06] <vgutierrez>	 sure, shoot
[10:22:17] <claime>	 I have a bunch of mw appservers being reported as down in alertmanager
[10:22:31] <claime>	 But I don't find a mention for some of them in pybal's log
[10:22:33] <vgutierrez>	 alert link?
[10:22:47] <claime>	 https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dsre&q=alertname%3DPybalBackendDown
[10:24:15] <claime>	 Hmm the alert time is set to time of the last down for all of them, which may be why I'm not finding most of them in recent logs
[10:26:21] <vgutierrez>	 https://www.irccloud.com/pastebin/d79WCFfQ/
[10:26:39] <vgutierrez>	 that's the current status from lvs2009, so the alerts are stale for some reason
[10:27:04] <vgutierrez>	 vgutierrez@lvs2009:~$ curl -s http://localhost:9090/alerts
[10:27:04] <vgutierrez>	 OK - All pools are healthy
[10:27:09] <vgutierrez>	 and the reason is not pybal :)
[10:28:26] <vgutierrez>	 lvs2010 is consistent with lvs2009
[10:29:22] <claime>	 ok, that's less worrying
[10:29:30] <claime>	 So now to figure out why the alert is stale
[10:29:50] <_joe_>	 what is the metric in the alert?
[10:30:09] <claime>	 pybal_monitor_status == 0
[10:30:18] <vgutierrez>	 lol.. we got team-sre/pybal.yaml and team-traffic/pybal.yaml
[10:30:42] <vgutierrez>	 errr
[10:30:45] <vgutierrez>	 for the last 12h
[10:31:21] <vgutierrez>	 so if a host is flagged as down during 5 seconds is gonna alert the next 12h
[10:31:27] <_joe_>	 I guess that's the problem
[10:31:34] <vgutierrez>	 unless I'm reading the alert  wrongly
[10:31:36] <vgutierrez>	 godog: ^^
[10:32:13] <_joe_>	 I meant the expression is the problem
[10:32:14] <vgutierrez>	 https://www.irccloud.com/pastebin/ukCliZOq/
[10:32:50] <_joe_>	 yeah I think it's just a wrong expression?
[10:32:56] <vgutierrez>	 why?
[10:33:07] <_joe_>	 as you said
[10:33:19] <_joe_>	 it will alert for 12 hours if it was down for 5 seconds
[10:33:26] <vgutierrez>	 yep.. that's more the for: 12h part
[10:33:29] <vgutierrez>	 than the query itself
[10:33:41] <godog>	 (checking)
[10:34:11] <vgutierrez>	 so we need something like pybal_monitor_status is 0 longer than X minutes in a row
[10:34:23] <_joe_>	 sorry gotta go afk for a bit
[10:38:50] <godog>	 the expression is correct in the sense that it already does check if pybal_monitor_status is 0 longer than X minutes in a row
[10:39:10] <godog>	 and from lvs2009 I see this
[10:39:11] <godog>	 pybal_monitor_status{host="mw2359.codfw.wmnet",monitor="IdleConnection",service="appservers-https_443"} 1.0
[10:39:15] <godog>	 pybal_monitor_status{host="mw2359.codfw.wmnet",monitor="ProxyFetch",service="appservers-https_443"} 0.0
[10:40:19] <godog>	 not sure how to check if proxyfetch is actually passing and the metric / pybal internal status isn't correct ?
[10:40:46] <claime>	 So it's pybal's internal metric that's wrong?
[10:41:21] <godog>	 I'm trying to verify whether that's the case yeah
[10:42:21] <question_mark>	 what does pybal's log file say :)
[10:42:55] <question_mark>	 it should do a proxyfetch attempt every 30s iirc and log that
[10:44:04] <godog>	 yeah I see some failures for proxyfetch but not for mw hosts
[10:44:34] <godog>	 but e.g. lvs2010 is fine
[10:44:41] <godog>	 pybal_monitor_status{host="mw2359.codfw.wmnet",monitor="IdleConnection",service="appservers-https_443"} 1.0
[10:44:43] <vgutierrez>	 hmm log is ok and pool status is ok as well
[10:44:44] <godog>	 pybal_monitor_status{host="mw2359.codfw.wmnet",monitor="ProxyFetch",service="appservers-https_443"} 1.0
[10:45:55] <godog>	 I'm checking mw2359 access logs and see if there are the pybal fetches there
[10:48:09] <godog>	 i.e. tail -f /var/log/apache2/*.log | grep -i pagegetter
[10:48:25] <godog>	 which does show probes, now to figure out which lvs they come from
[10:49:40] <godog>	 thoughts on how to do this? the ips in access logs are mw2359's
[10:50:06] <question_mark>	 pybal logs the status of the monitor (and the status of the host) for every log of proxyfetch, so every 30s. does that not match with the prom metric?
[10:50:47] <question_mark>	 and pybal health checks are done from the lvs server's main ip
[10:50:57] <godog>	 I'm not seeing successes (only failures) for proxyfetch in pybal.log
[10:51:03] <_joe_>	 pybal sees all servers as enabled/up/pooled
[10:51:18] <_joe_>	 godog: yes that's normal, successes are only shown when running at debug log level
[10:51:27] <_joe_>	 lvs2009:~$ curl localhost:9090/pools/appservers-https_443 
[10:51:34] <_joe_>	 this is how pybal sees its status
[10:51:51] <_joe_>	 if the prom metrics say proxyfetch is failing, that is incorrect
[10:51:59] <question_mark>	 indeed
[10:52:40] <question_mark>	 i wrote that prom metrics code in pybal at wikimania 2017, when i was already a director/manager for years, so clearly that code can't be trusted
[10:52:53] <question_mark>	 (in my defense, while working side by side with ema)
[10:53:16] <godog>	 heheh I do remember! I was at the table too
[10:53:56] <godog>	 ok so on mw2359 access log I see pagegetter fetching http://en.wikipedia.org/wiki/Special:BlankPage?force_php74=1 and http://en.wikipedia.org/wiki/Special:BlankPage
[10:54:19] <godog>	 though only two such requests at a time, not four, which suggests one pybal is not actually checking ?
[10:54:52] <godog>	 or maybe I'm wrong, still trying to figure out where the requests are coming from
[10:58:45] <claime>	 That second url check should probably be commented out until we actually have different php versions to check
[10:58:51] <claime>	 (won't fix the current issue tho)
[11:02:58] <claime>	 Are both the secondary and primary LVS checking? Because on another completely unrelated mw server (in eqiad, and not one of the ones failing), I only see two at a time too
[11:03:40] <godog>	 claime: sorry I think I might have muddied the waters there, still verifying
[11:04:16] <godog>	 it makes sense that there are "two at a time" from a single proxyfetch checking both urls
[11:04:22] <claime>	 yeah
[11:09:41] <godog>	 but yeah I think it is pybal_monitor_status not updating as expected, pybal_monitor_up_results_total and pybal_monitor_down_results_total check out
[11:09:55] <godog>	 i.e. up_results increments and down_results doesn't
[11:12:55] <godog>	 ok I'll add that check to the expression
[11:13:10] <godog>	 easier than patch and deploy pybal for sure
[11:21:28] <godog>	 https://gerrit.wikimedia.org/r/c/operations/alerts/+/902690 claime vgutierrez 
[11:22:35] <godog>	 cheers claime 
[11:24:09] <godog>	 will be live in the next 30 min
[11:24:25] <claime>	 👍
[11:24:35] <claime>	 Thanks for the investigation and fix <3
[11:25:51] <godog>	 you are welcome! I'm glad we could bandaid
[11:45:00] <claime>	 I wonder why it's manifesting now and not before though
[11:52:49] <claime>	 godog: Alert disappeared, thanks
[11:58:04] <vgutierrez>	 claime: cause it's Friday and you're on call ;P
[11:58:15] <claime>	 vgutierrez: Probably
[11:58:20] <claime>	 :)
[12:46:23] <godog>	 lol
[15:38:40] <wikibugs>	 10Domains, 10Traffic, 10SRE: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10BCornwall) @Aklapper, I would agree to decline this but for the line mentioning that enwp.org is in widespread use. If it is (it'd be good to see some stats, @violetwtf!) then it might be worth accepting the donation...
[22:15:19] <wikibugs>	 10Domains, 10SRE, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10CRoslof) @BCornwall No updates at the moment. We have a lot of items in our enforcement queue, so it can take a while. If there is a particular need to have these domain names registered so that they...
[22:19:03] <wikibugs>	 10Domains, 10SRE, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) Thanks, @CRoslof! Was this not in the queue before? It's a pretty old ticket!
[22:43:03] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) 05Open→03Stalled
[22:45:59] <wikibugs>	 10HTTPS, 10Traffic, 10SRE, 10Tracking-Neverending: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681 (10BCornwall)
[22:46:07] <wikibugs>	 10HTTPS, 10Traffic, 10SRE: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) 05In progress→03Stalled Is there any outcome to envision other than moving the domain? If not, I can close this and open another ticket to move store.wikimedia.org to a differ...
[22:50:02] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) p:05Medium→03Low
[23:15:38] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Papaul)