[00:24:37] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) >>! In T297517#7566856, @brennen wrote: > We're currently on 1.38.0-wmf.9, and this remains a block... [00:27:24] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) The only thing unique to this report as compared to T296098 and T296063 is the failure mode, i.e. m... [00:55:19] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10brennen) > Is tuning the kernel the thing that you want unbroken now? Again, it has probably been broken for y... [00:55:30] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10brennen) [00:58:39] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10ssastry) Since the train was rolled forward from wmf.9 -> wmf.12 today, [[ https://grafana.wikimedia.org/d/000... [02:27:02] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) I filed T297667 for the PHP bug which I'm working on. [04:22:59] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) It indeed looks like wmf.12 has increased db traffic: https://grafana.wikimedia.org/d/000000278/mys... [04:30:21] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) Created {T297669} for the database issue. [07:14:06] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Joe) >>! In T297517#7568203, @tstarling wrote: >>>! In T297517#7566856, @brennen wrote: >> We're currently on... [07:16:54] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Joe) >>! In T297517#7568208, @tstarling wrote: > The only thing unique to this report as compared to T296098 a... [07:44:35] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team: Build MediaWiki images for kubernetes on the deployment servers - https://phabricator.wikimedia.org/T297673 (10Joe) [08:12:15] 10serviceops, 10Performance-Team (Radar): Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Joe) [08:12:24] 10serviceops: Allow coexisting php version in our puppet code - https://phabricator.wikimedia.org/T293450 (10Joe) 05Open→03Resolved p:05Triage→03High [08:13:14] 10serviceops, 10Infrastructure-Foundations, 10Mail, 10SRE, 10Znuny: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10akosiaris) Had a quick look at that. It is true that we never have r... [13:54:55] 10serviceops, 10Infrastructure-Foundations, 10Mail, 10SRE, 10Znuny: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10akosiaris) p:05Triage→03Low Code found. https://github.com/znuny... [14:20:14] 10serviceops, 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10Marostegui) p:05Triage→03Medium [14:29:00] NAME READY SECRET AGE [14:29:02] certificate.cert-manager.io/jayme-testcert True jayme-testcert-tls-certificate 21s [14:29:06] \o/ [14:31:33] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10hashar) The issue appeared with wmf.12 which is fully deployed now and it does not seem we will roll it back.... [14:40:02] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10ssastry) >>! In T297517#7569567, @hashar wrote: > The issue appeared with wmf.12 which is fully deployed now a... [14:52:56] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) So I have been working on this on several fronts (with Daniel and Tim). The [[https://gerrit.wikime... [15:02:47] <_joe_> ottomata: citing martinfowler, cough cough JAVA ALERT :D [15:03:17] _joe_: ? [15:03:33] <_joe_> https://phabricator.wikimedia.org/T291120#7569574 [15:03:36] <_joe_> :P [15:04:34] <_joe_> let's say I'm not a fan of his work in general [15:04:40] ha, seemed the most reputable link for a definition? i see! [15:05:27] i have no opinion on his work execpt for he's got good pages describing and defining things that I can link to :) [15:05:52] <_joe_> yeah I was just teasing you [15:06:35] :) [15:11:59] 10serviceops, 10MediaWiki-extensions-TranslationNotifications, 10MW-1.38-notes (1.38.0-wmf.5; 2021-10-19): Fatal error: Uncaught Error: Class 'MediaWiki\MediaWikiServices' not found - mediawiki_job_translationnotifications - https://phabricator.wikimedia.org/T293702 (10MatthewVernon) [15:13:12] 10serviceops, 10Internet-Archive: Improve download speed from archive.org on appservers - https://phabricator.wikimedia.org/T295009 (10MatthewVernon) [15:17:27] to follow up on the istio egress stuff discussed yesterday [15:18:16] I created a cert signed by the puppet CA via cergen for the istio egress k8s svc, and configured it to proxy to api-ro.discovery.wmnet and thanos-swift.discovery.wmnet [15:19:16] then I instructed the kserve pods for revscoring/ores to use the k8s svc endpoint, setting as host header something.wikipedia.org [15:19:46] the pods contact the egress gw via TLS, that in turn creates a TLS connection to api-ro.discovery.wmnet [15:20:23] I think that there is a way to use TLS passthrough, but for the moment the above seems enough [15:20:34] in theory we could apply rate limiting etc.. at the egress gw level [15:21:02] this is not exactly like having a mesh with istio taking care of routing/proxying for the pods, but could be enough for the moment [15:24:47] <_joe_> elukey: most of the stuff you need to call is in another cluster, in a different mesh anyways [15:27:04] _joe_ yep yep, it is convenient to have the istio mesh for the automagic istio proxy stuff (rather than doing it manually) [15:27:35] ok the approach doesn't seem insane from what I gathered, I'll keep going and add some real helm config :) [15:31:57] 10serviceops, 10Infrastructure-Foundations, 10SRE-tools, 10netops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10MatthewVernon) [15:40:36] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Helm chart dependencies no longer in requirements.yaml - https://phabricator.wikimedia.org/T295750 (10MatthewVernon) [15:59:00] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) I need to go to a meeting but after that, I'll run a rolling restart [16:05:09] 10serviceops, 10Wikimedia-Developer-Portal, 10Service-deployment-requests: New Service Request: developer-portal - https://phabricator.wikimedia.org/T297140 (10MatthewVernon) [16:06:12] 10serviceops, 10Internet-Archive, 10InternetArchiveBot: Determine appropriate API request limits for InternetArchiveBot - https://phabricator.wikimedia.org/T296577 (10MatthewVernon) [17:11:24] is eventgate-main still only pooled in eqiad? https://phabricator.wikimedia.org/T296699 [17:12:18] ottomata: No, I see it also pooled in codfw [17:12:29] per https://config-master.wikimedia.org/pybal/codfw/eventgate-main [17:13:11] <_joe_> mutante: that's not what he meant [17:13:16] <_joe_> he meant pooled in discovery [17:13:26] oh, ok [17:14:00] <_joe_> ottomata: https://config-master.wikimedia.org/discovery/discovery-basic.yaml says yes [17:14:16] <_joe_> conftctl confirms :) [17:20:10] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10thcipriani) Documenting my understanding of this problem after reading this task (along with T297669 and T2976... [17:35:07] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) I don't have strong opinions but I think wmf.12 issues are "mitigated" (but not resolved) and wmf.1... [17:40:28] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Joe) FWIW, I wholeheartedly agree with @thcipriani's opinions above. As for the remaining work: we need to ru... [17:43:49] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10ssastry) >>! In T297517#7570257, @thcipriani wrote: > > - I would prefer we either (a) abandon wmf.12 and roll... [18:03:56] 10serviceops, 10MW-on-K8s: On the kube-experimental mwdebug cluster, MediaWiki sees all edits as coming from localhost - https://phabricator.wikimedia.org/T297613 (10Joe) I added a debug script that just dumps $_SERVER, and indeed REMOTE_ADDR is 127.0.0.1, while on mwdebug1001 it's set to the IP address of the... [18:04:58] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Zabe) >>! In T297517#7570358, @ssastry wrote: > [...] does that mean if wmf.13 had to be rolled back, it will... [18:13:03] _joe_: mutante ok, so can we resolve that ticket? [18:13:38] oh 'yes' means only pooled in eqiad [18:14:04] <_joe_> lol yes I was about to ask you if you didn't feel comfortable repooling it yourself [18:14:17] well, i don't fully understand the reasons why it was depooled [18:14:24] <_joe_> me neither! [18:14:36] <_joe_> as in, I haven't looked [18:14:48] ok, i'll ask david [18:14:52] dcausse: o/ [18:14:56] can we do https://phabricator.wikimedia.org/T296699 ? [18:15:00] o/ [18:15:04] your response on there seems...yes? [18:15:18] ottomata: yes please go ahead [18:15:23] <_joe_> yeah looks like it [18:17:58] i do this rarely, am reading conftool docs [18:21:41] confctl --quiet --object-type discovery select 'dnsdisc=eventgate-main,name=codfw' set/pooled=true [18:21:45] _joe_: ^ that look right to you? [18:24:18] <_joe_> ottomata: minus the quiet [18:24:19] <_joe_> :) [18:24:25] <_joe_> be noisy about it [18:24:49] kay :) [18:26:33] did [18:26:35] it [18:26:37] ty [18:28:59] 10serviceops, 10SRE, 10Patch-For-Review: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Dzahn) 05Open→03Stalled stalled on https://gerrit.wikimedia.org/r/c/operations/puppet/+/736596/5 [19:02:08] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review, 10User-jijiki: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 (10Jgiannelos) [19:54:48] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10hashar) We will promote testwikis to wmf.13 in a few minutes. Tomorrow evening we would had wmf.12 running on... [21:23:41] 10serviceops, 10SRE, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10hashar) After some deployment issue, 1.38.0-wmf.13 has reached group 0 wikis. [23:32:22] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10Legoktm) a:03Legoktm Benchmarking a current Parsoid server is straightforward, just need to depool it and start the script. Since all the new servers are non-Parso... [23:39:33] 10serviceops, 10SRE, 10Patch-For-Review: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Dzahn) the "all-mw-*" aliases now include parsoid servers: ` before: [cumin1001:~] $ sudo cumin A:all-mw-eqiad 'uptime' 157 hosts will be targeted: mw[1302-1456].eqi... [23:41:44] 10serviceops, 10SRE, 10Patch-For-Review: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Dzahn) 05Stalled→03Resolved I did add them to "all-mw" while not touching core "mw". Based on Gerrit comments etc. Hope this still resolves it! [23:56:14] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10ssastry) >>! In T297259#7571273, @Legoktm wrote: > Benchmarking a current Parsoid server is straightforward, just need to depool it and start the script. Since all...