[00:18:08] 10serviceops, 10CirrusSearch, 10MediaWiki-Configuration, 10MediaWiki-Engineering, 10Discovery-Search (Current work): Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 (10aaron) If T246371 was done, then the stream updater could just m... [08:23:08] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) [08:23:26] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) [08:24:02] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) a:05Clement_Goubert→03Joe [08:41:59] hi folks [08:42:47] I'm debugging an issue with varnish on cp1107 and I need to understand why/how en.wp.o is getting responses from mw-on-k8s [08:43:15] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10JMeybohm) Couldn't we just add another mobileapps deployment (like a canary) that connects to mw-api-int and scale that up slowly while scaling the exis... [08:44:10] oh.. I'm guessing that's ['default'] = 0.15, on the mw-on-k8s.lua.conf [08:46:30] <_joe_> vgutierrez: yeah... [08:46:36] * _joe_ hands over some coffee [08:47:08] so that's expected.. what I'm not expecting is a 404 for a valid en.wp.o URL [08:47:14] https://www.irccloud.com/pastebin/cXPcFGi1/ [08:47:40] <_joe_> do you think the problem is mw on k8s? [08:47:48] not yet [08:47:55] * vgutierrez trying to reproduce [08:48:12] <_joe_> let me give you the curl to call mw on k8s [08:49:34] <_joe_> curl -H 'Host: en.wikipedia.org' -H 'X-Forwarded-Proto: https' https://mw-web.discovery.wmnet:4450/wiki/Coffea_Liberica [08:49:42] <_joe_> and it returns 404 [08:50:25] testing with for i in {1..100}; do curl --connect-to en.wikipedia.org:443:$(dig +short mw-web-ro.discovery.wmnet):4450 https://en.wikipedia.org/wiki/Coffea_liberica -v -o /dev/null -s 2>&1 |egrep HTTP/1.1; done [08:50:33] <_joe_> same with an appserver [08:50:37] that's a solid 200 [08:50:48] <_joe_> it's just an inexistent page on the backend right now [08:50:55] :? [08:51:01] it's a valid wiki page AFAIK [08:51:01] <_joe_> curl -H 'Host: en.wikipedia.org' -H 'X-Forwarded-Proto: https' https://appservers-rw.discovery.wmnet/wiki/Coffea_Liberica -I gives the same result [08:51:03] <_joe_> 404 [08:51:14] <_joe_> I'm just saying what I see right now [08:51:27] _joe_: err.. I'm reading that webpage on my browser.. so we have some kind of issue here [08:51:34] <_joe_> so something must be wrong yes [08:51:40] <_joe_> in my curl [08:51:49] <_joe_> not sure what [08:54:25] <_joe_> can you summarize please? [08:55:09] <_joe_> vgutierrez: capitalizations... [08:55:25] duh [08:55:31] I'm chasing a ghost [08:55:32] sorry about that [08:55:33] <_joe_> curl -H 'Host: en.wikipedia.org' -H 'X-Forwarded-Proto: https' https://appservers-rw.discovery.wmnet/wiki/Coffea_liberica 200 OK [08:55:42] * vgutierrez going back to his varnish issue [08:57:59] <_joe_> please double your coffee intake [08:58:41] so my *real* issue is that varnish flagging stuff as HFP that it shouldn't [09:01:07] <_joe_> uh that is NOT good [09:01:13] <_joe_> it's all varnishes or just that one? [09:04:16] new hosts in eqiad [09:05:14] not currently pooled [09:47:08] _joe_: found the issue [09:47:14] https://www.irccloud.com/pastebin/okeOnuAl/ [09:47:47] _joe_: dunno what I'm missing here regarding XFF or the particular IP of cp1107.. but that "Expires:" header shouldn't be there at all [09:48:24] in both cases mw returns a 200 with the expected content, but the headers aren't what we are expecting [09:49:17] same for Cache-Control [09:49:26] Cache-Control: private, must-revalidate, max-age=0 VS Cache-Control: s-maxage=86400, must-revalidate, max-age=0 [09:52:44] vgutierrez: wmf-config/reverse-proxy.php needs updating again I think.. last time we added the subnets for e/f[1-4] but since then 5-8 were also taken into use and allocated subnets not in the config file yet [09:53:07] taavi: is wmf/reverse-proxy.php new? [09:53:37] no [09:53:38] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/reverse-proxy.php [09:53:44] cause 10.64.22.0/ is private-1b-eqiad.. not new at all [09:54:05] line 31... '10.64.16.0/22', # private1-b-eqiad [09:54:28] sorry, I meant 10.64.16.0, not 10.64.22.0 [09:54:43] oh, sorry, I looked at netbox and saw B7 and somehow only read the 7 and thought it was in the new cage [09:54:49] ignore me [09:56:38] np :) [10:02:23] 10serviceops: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 (10JMeybohm) [10:03:30] 10serviceops: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 (10JMeybohm) 05Open→03Resolved `updateCollation.php` has finished for all relevant wikis, resolving this [10:03:37] 10serviceops, 10Dumps-Generation, 10MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432 (10JMeybohm) [10:09:11] <_joe_> vgutierrez: I fail to see how I could resolve that issue, though. It seems a mw regression tbh [10:09:22] <_joe_> unless it's specific to that cp host [10:15:44] _joe_: just picking up your brain, not saying that it should be addressed by you specifically :) [11:07:56] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) >>! In T350846#9318457, @JMeybohm wrote: > Couldn't we just add another mobileapps release (like a canary) that connects to mw-api-int and scale th... [11:44:45] 10serviceops, 10Content-Transform-Team-WIP, 10Parsoid, 10RESTBase, and 4 others: Requests originating from zhwiki wikifeeds caused parsoid outage - https://phabricator.wikimedia.org/T346657 (10Joe) Given the patch is now live with the latest train, I've disabled the rule for now. [11:46:55] 10serviceops, 10Content-Transform-Team-WIP, 10Parsoid, 10RESTBase, and 4 others: Requests originating from zhwiki wikifeeds caused parsoid outage - https://phabricator.wikimedia.org/T346657 (10Joe) And good news, most requests to the endpoint now take 50-100ms to get a response, instead than 5-10 seconds.... [13:16:57] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10kamila) A prototype deployment in eqiad with a pretty vanilla config: - generates <80k timeseries; this may be a bit more in codfw as it has more objects, so let's call it... [13:40:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10kamila) @fgiunchedi is adding up to 100k timeseries per k8s cluster OK? [13:54:37] o/ jayme is it possible something in between envoy and schema.svc is terminating connections unexpectedly sometimes? https://phabricator.wikimedia.org/T350713#9316944 [14:06:08] there is nothing in between really. Could it be that the first request fails due to the idle timeout and the following ones succeed? I think it's easy to try and allign the timeouts (e.g. making the envoy one shorter) and see if it helps. If not, I would try to tcpdump on both ends to see whats going on [14:07:19] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10fgiunchedi) >>! In T264625#9319210, @kamila wrote: > @fgiunchedi is adding up to 100k timeseries per k8s cluster OK? Not sure off the bat, do you have a dump or a sample o... [14:07:39] 10serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar), 10Patch-For-Review: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10Krinkle) [14:11:02] 10serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10Patch-For-Review: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10Krinkle) Moving to our inbox. This will require a change in wmf-config for the "mcrouter" BagOStuff instance. [14:24:21] 10serviceops, 10Traffic: MW returns uncacheable responses for en.wikipedia.org when specific XFF values are sent - https://phabricator.wikimedia.org/T350861 (10Fabfur) [14:25:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10kamila) >>! In T264625#9319298, @fgiunchedi wrote: >>>! In T264625#9319210, @kamila wrote: >> @fgiunchedi is adding up to 100k timeseries per k8s cluster OK? > > Not sure... [14:38:21] jayme: okay thanks, i was tcpdumping on the envoy end but it was difficult to figure out what belonged to the 503s. didn't try on the schema.svc side. [14:39:14] might as well alight the timeouts, i suppose i'll do it on the nginx side and set the keepalive_timeout to 1h? [14:48:30] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10fgiunchedi) >>! In T264625#9319374, @kamila wrote: >>>! In T264625#9319298, @fgiunchedi wrote: >>>>! In T264625#9319210, @kamila wrote: >>> @fgiunchedi is adding up to 100k... [15:23:34] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) As it's clear from the patches, I chose to take the sage advice of @JMeybohm and go down the path of least resistance :) [15:24:32] <_joe_> ottomata: my suggestion with envoy is to set its idle timeout well below the limit of the keepalive on its upstream [15:24:45] <_joe_> so set it to 10 seconds, say [15:37:17] oh, okay, so the other way around. [15:37:44] okay, i'll do that. to apply that, i need to merge the change in puppet, run puppet on deployment host, and redeploy app? [15:45:18] doing for eventstreams-internal... [15:54:34] wow, huh, looks like the 503s are going away... will wait a bit longer [15:57:04] 10serviceops, 10Traffic: MW returns uncacheable responses for en.wikipedia.org when specific XFF values are sent - https://phabricator.wikimedia.org/T350861 (10Joe) [16:11:02] 10serviceops, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) @joe's suggestion: > my suggestion with envo... [16:42:23] 10serviceops, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Ah, nope. I spoke too soon: eventstreams: h... [16:42:48] _joe_: still have 503s: https://phabricator.wikimedia.org/T350713#9319976 [16:46:10] <_joe_> same number or less? [16:46:25] <_joe_> sorry in meetings neverending for another hour [16:47:15] looks about the same [16:48:07] https://grafana.wikimedia.org/goto/3SjKIj4Ik?orgId=1 [16:51:25] 10serviceops, 10Growth-Team, 10Growth-Team-Filtering, 10StructuredDiscussions, and 2 others: [{exception_id}] {exception_url} Flow\Exception\FlowException from line 397 of /srv/mediawiki/php-1.33.0-wmf.8/extensions/Flow/includes/Block/TopicListBlock.php: The `newes... - https://phabricator.wikimedia.org/T211798 [16:55:33] 10serviceops, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Very similar sounding issue: https://github.... [16:59:27] 10serviceops, 10MW-on-K8s: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10bd808) [20:00:27] 10serviceops, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Added a retry_policy for 5xx, but still getti... [20:53:52] 10serviceops, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors - https://phabricator.wikimedia.org/T350713 (10Ottomata) Related, I think: {T263... [20:59:14] 10serviceops, 10CirrusSearch, 10MediaWiki-Configuration, 10MediaWiki-Engineering, 10Discovery-Search (Current work): Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 (10EBernhardson) I'm still not sure that would be better off in a j... [22:26:53] 10serviceops, 10CirrusSearch, 10MediaWiki-Configuration, 10MediaWiki-Engineering, 10Discovery-Search (Current work): Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 (10aaron) I mean that the job would never be enqueued into kafka, i...