[01:57:22] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10tstarling) Reading https://github.com/ruflin/Elastica/issues/1913 , it looks like the way out of that infinite regression is to just use --ignore-platform-req=php, or patch composer.json, since the... [06:38:18] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) From 2021-09-04 restbase has been reporting a lot of connection errors (to what it seems Wikifeeds judgding from the URI): https://logstash.wikimedia.org/goto... [06:39:06] good morning elukey :) [06:47:28] jayme: o/ [06:48:14] I am checking in turnilo some UA for the wikifeeds issue, all timeouts in restbase seem to be related to WikipediaApp UA (but not sure if there are other things that call the same URI) [06:48:54] IIRC there is this weird restbase calls wikifeeds calls restbase situation which we might see here [06:48:56] if I run `curl https://en.wikipedia.org/api/rest_v1/page/random/summary -i` in a row multiple times I get consistently a 504 after some tries [06:49:22] maybe there was a new app version or something starting on the 4th [06:50:26] ack [06:50:29] would it be worth to raise a bit the limits for the cpu throttling to see if anything improves? [06:51:06] I still don't see quite a difference there compared to pre-problem times [06:51:48] yeah I know but it may play a role into this, ideally in my mind pods shouldn't have cpu throttling in these use cases [06:52:48] true, but in reality they almost ever have :) [06:53:52] but we should try, you are right [06:54:27] AFAICT the main throttling is on tls-proxy...or do you have a different interpretation elukey? [06:56:03] I didn't check yesterday, trusting you, and the tls-proxy returns a ton of 503s [06:56:14] so not really timeouts or something weird like that [06:56:45] elukey: I'm not sure about that. See my comment from yesterday night. I think those are all upstream connection failures *to* restbase [06:57:58] yes yes I agree [06:58:08] for various URIs [06:58:28] what I am seeing from the restbase point of view is a lot of connection timeouts / disconnect /etc.. for the same URI [06:58:52] that is /api/rest_v1/page/random/summary, mostly called by WikipediaApp apparently and handled by wikifeeds IIUC [06:58:57] that in turn calls restbase [06:59:02] getting a 503 [06:59:09] but only sometimes [07:00:36] https://w.wiki/42Pw doesn't show anything really helpful [07:01:02] there are some WikipediaApp UAs that start around the first of September [07:01:07] but nothing matching what we are seeing [07:04:09] jayme: it is interesting though that on the 4th we had the mw api outage - https://phabricator.wikimedia.org/T290374 [07:04:36] yeah, there is a big spike in almost every dashboard [07:05:04] I was looking this zoom https://grafana.wikimedia.org/goto/Nu6pDmS7k [07:05:11] and it seems as if it was the trigger [07:05:35] agreed [07:07:49] is it possible that pods are somehow in a weird state (maybe the nodejs app is)? [07:08:46] there are pods in the wikifeeds namespace 4d old, and others more recent, but they are allo showing the tls-proxy 503s [07:08:49] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10MSantos) The source of the failure could be this one in Wikifeeds https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2021.09.07?id=jmADv3sB9aenX452C... [07:09:04] I was wondering if it was only a subset of old pods [07:09:27] elukey: most of them have been evited yesterday [07:09:43] we can do the following: roll restart, check if things change [07:10:04] if not: increase CPU resources on tls-proxy [07:10:08] turn off and on again, yes [07:10:28] I see from the task that a patch is incoming, but I'd be curious to see if a roll restart improves things [07:10:55] jayme: +1 from my side to test a roll-restart [07:11:02] doing [07:14:40] hmm, the document msantos posted sould just lead to a 500, not 50{3,4} [07:14:54] elukey: restart completed [07:15:36] nice [07:16:08] I don't see 503s from the tls-proxy [07:16:34] ahaahhaah [07:16:38] * elukey cries in a corner [07:16:59] ah no wait I saw one right now in a pod [07:17:16] but maybe once in a while is ok [07:18:21] the restbase errors went down A LOT though (from logstash) [07:18:38] same from envoy dashboards [07:20:34] still see a lot of upstream errors thought (from the link that you posted in the task) [07:22:24] elukey: logstash? [07:23:23] jayme: nono grafana [07:23:53] ah no wait I see that recent datapoints are ok [07:23:56] nevermind [07:24:03] ok [07:24:10] I got tricked by the colors, selecting one-by-one shows the real thing [07:24:15] 2rps is fine :) [07:25:21] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) Thanks @MSantos Update from IRC: me and @JMeybohm noticed that in the k8s wikifeeds graphs, the rise of the errors (Sept 4th ~02:30 UTC) corresponded to a b... [07:25:25] updated the task --^ [07:25:57] elukey: I had envoy debug logging activated for a couple of minutes in the now gone pod wikifeeds-production-6db5957576-z7mnk (should be in logstash) [07:26:49] anything interesting? [07:26:52] I still do think the callbacks from wikifeeds to restbase where at fault [07:27:07] not done checking. It's a mess to read them in logstash [07:27:29] yeah, I added to the task that it was an inconsistent state of the app or the tls-proxy that causes intermittent 503s [07:27:47] or the tls-proxy of restbase or restbase :D [07:27:47] the "why" is a mistery though [07:28:23] jayme: we have roll restarted wikifeeds so I am inclined to point the finger to it, the rest self recovered [07:28:58] maybe lingering tcp connections in the proxy? [07:29:08] elukey: sure. But that potentially termiated some old tcp connections etc... [07:29:10] yeah [07:30:09] maybe envoy tried to write to old sockets broke by the mw api outage, never really cleaning up them for some reason [07:31:10] jayme: are there any other services with the tls-proxy container ? [07:31:21] I'd be interested to check their status [07:31:23] elukey: each and every one [07:32:38] this smells like a weird envoy corner case / bug that we'll never find [07:33:55] jayme: side note - how do I roll restart something in k8s? (if there is a doc I can RTM) [07:34:36] helmfile -e codfw --state-values-set roll_restart=1 sync [07:34:53] lovely thanks [07:35:02] there potentially is some documentation but I don't know [07:35:18] it's actually a hack in helmfile.yaml [07:38:14] ok wikifeeds looks ok now, we can leave the task open for the moment but it will be tough to get anything more out of it [07:48:38] agreed [07:49:47] from the debug I've captured it seems as if POST requests to mwapi had failed/timed out. But that does not really match the UF failures I saw yesterday (which clearly targeted restbase) [07:50:22] thanks for the help and the work! [07:50:45] same to you [07:52:44] now let's get back to the work of throwing even more envoy into the infra :p [08:46:41] 10serviceops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi, 10User-jijiki: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 (10fgiunchedi) [10:25:00] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Add label kubernetes.io/metadata.name to all namespaces - https://phabricator.wikimedia.org/T290476 (10JMeybohm) p:05Triage→03Medium [11:14:00] 10serviceops, 10MW-on-K8s, 10SRE: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm) [12:49:26] jelto: you're not playing around with staging-codfw currently, are you? [12:49:50] I'm planning to fire a round of istio at it [12:51:27] jayme: no feel free to do your istio tests [13:45:58] 10serviceops, 10SRE, 10ops-codfw: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Papaul) 05Open→03Resolved @Dzahn I checked the server today i have no errors showing on A1 closing this task . IF we have the error again please reopen the task. Thanks [13:48:17] 10serviceops, 10SRE, 10ops-codfw: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Dzahn) Thank you @Papaul I will repool the server. [15:07:21] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10WikimediaDebug: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) [15:07:27] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Krinkle) [15:10:26] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10WikimediaDebug: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) @dpifke Effie did some benchmarking today for which XHGui was needed. tideways is installed and enabled... [16:09:57] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10WikimediaDebug: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10jijiki) [16:14:06] 10serviceops, 10MW-on-K8s: mediawiki-debug image does not produce profiling info - https://phabricator.wikimedia.org/T290485 (10jijiki) [16:14:30] 10serviceops, 10MW-on-K8s: mediawiki-debug image does not produce profiling info - https://phabricator.wikimedia.org/T290485 (10jijiki) [16:14:32] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [16:51:35] cross post from releng: I would like to delete the WMCS instance 'gitlab' in the project gitlab-test due to quota limits. The instance is stopped since 1 week. Is anyone still using the instance? Please let me know. Maybe jbond? [16:52:25] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Reedy) >>! In T271736#7335188, @tstarling wrote: > Reading https://github.com/ruflin/Elastica/issues/1913 , it looks like the way out of that infinite regression is to just use --ignore-platform-re... [16:53:35] jelto: that is the machin that S&F and releng used for development, i would check wih brennen [17:34:57] 10serviceops, 10MW-on-K8s, 10Kubernetes: Kubernetes timeing out before pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10dancy) @JMeybohm By the way, I think I managed to get the 'jenkins' k8s account auto-banned in the staging cluster while experimenting on Friday. At... [18:02:11] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) Our initial benchmarks that @akosiaris showed that k8s was slower than baremetal, while at higher concurrencies the difference between the two was smaller. We have observed our b... [18:57:50] 10serviceops, 10MW-on-K8s: mediawiki-debug image does not produce profiling info - https://phabricator.wikimedia.org/T290485 (10jijiki) 05Open→03Resolved a:03jijiki [18:57:54] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10WikimediaDebug: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10jijiki) [18:58:00] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [19:23:40] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Krinkle) For the MW side, we have these two fairly popular polyfills for third-parties: * [Obsidi... [22:41:39] 10serviceops, 10GitLab, 10Release-Engineering-Team (Next), 10User-brennen: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802 (10brennen) > I've uploaded 14.0.10, we can bump the import hook after the initial update is complete. Thanks - today got away from me, planning to run... [23:11:36] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10Legoktm) >>! In T245464#7337536, @Krinkle wrote: > For the MW side, we have these two fairly popul...