[09:26:14] 06serviceops, 06Data-Persistence, 10Prod-Kubernetes: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters - https://phabricator.wikimedia.org/T379599 (10JMeybohm) 03NEW [09:37:00] 06serviceops, 10MediaWiki-extensions-PropertySuggester, 10MW-on-K8s, 10Wikidata, and 2 others: [PS] Update PropertySuggester update process for mwscript-k8s - https://phabricator.wikimedia.org/T376604#10311418 (10ArthurTaylor) I had a quick look at approach #1, and it looks doable. The `BasicImporter` ther... [11:24:40] 06serviceops, 10MediaWiki-extensions-PropertySuggester, 10MW-on-K8s, 10Wikidata, and 2 others: [PS] Update PropertySuggester update process for mwscript-k8s - https://phabricator.wikimedia.org/T376604#10311764 (10Lucas_Werkmeister_WMDE) I think if the script runs relatively fast in production (if the wikit... [12:29:36] 06serviceops, 10Thumbor, 13Patch-For-Review: Thumbor haproxy readiness check isn't failing on unhealthy pods - https://phabricator.wikimedia.org/T379561#10311957 (10hnowlan) 05Open→03In progress [13:29:13] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 07Kubernetes: wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed. - https://phabricator.wikimedia.org/T379622 (10JMeybohm) 03NEW [14:01:20] 06serviceops, 06SRE, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628 (10jnuche) 03NEW [14:01:41] 06serviceops, 06SRE, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312343 (10jnuche) p:05Triage→03Unbreak! [14:01:59] 06serviceops, 06SRE, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312346 (10jnuche) [14:04:02] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 07Kubernetes: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629 (10JMeybohm) 03NEW [14:06:45] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 07Kubernetes: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10312375 (10JMeybohm) [14:08:12] 06serviceops, 06SRE, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312371 (10Urbanecm_WMF) This also started to affect backports: ` 14:02:56 Started scap sync-world: Backport for [[gerrit:1090455|[CirrusSearch] t... [14:09:48] 06serviceops, 06SRE, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312388 (10jnuche) Script works outside of the container apparently: > Gergő Tisza running it by hand seems to work fine for me > 3:05... [14:12:54] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed. - https://phabricator.wikimedia.org/T379622#10312403 (10akosiaris) This is one host that is past the 5 year old mark for what is worth.... [14:33:26] 06serviceops, 06SRE, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312530 (10Tgr) See also {T379589} which seems to have the same cause (using a mock DB config for offline operations) but occurred at a later scap... [14:54:33] 06serviceops, 06SRE, 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10312636 (10CDanis) @bvibber @aude @Jdlrobson @CCiufo-WMF @Seddon FYI [15:09:45] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10312707 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This was fixed by removing the internal NIC(s) as well as the unused port of the 10G... [15:13:36] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876#10312730 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin2002 Reimaging k8s control planes of cluster wikikube-eqiad: container... [15:15:05] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876#10312736 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin2002 Reimaging k8s control planes of cluster wikikube-eqiad: container... [16:08:56] 06serviceops, 06SRE, 10Release-Engineering-Team (Radar), 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10313092 (10brennen) [16:30:42] 06serviceops, 10Citoid, 10VisualEditor, 10VisualEditor-MediaWiki-References, and 2 others: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare - https://phabricator.wikimedia.org/T370118#10313191 (10Mvolz) Any news? [16:36:07] 06serviceops, 06Wikipedia-Android-App-Backlog: Timeout errors when making requests to Firebase for push notifications - https://phabricator.wikimedia.org/T379647 (10Dbrant) 03NEW [16:41:46] 06serviceops, 06SRE, 10Release-Engineering-Team (Radar), 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10313250 (10dancy) scap 4.123.0 has been deployed which should address this problem. [16:54:24] nemo-yiannis: re: T379647 is it possible that there's a new code path where we aren't setting the httpproxy env var? [16:55:03] i don't think so, but i will check [16:55:38] 06serviceops, 06Wikipedia-Android-App-Backlog: Timeout errors when making requests to Firebase for push notifications - https://phabricator.wikimedia.org/T379647#10313383 (10Jgiannelos) I am reverting production envs only and leave staging in case SREs need to debug the actual running pod. FYI we tried to repr... [16:55:40] I was about to say, from the logs it seems that you are trying to hit directly the firebase's public endpoint [16:55:47] I reverted but left staging the failing version of the image so we can debug it [16:56:15] also out of curiosity, do you have a 'canary' release in production? [16:56:56] we do but the healthcheck doesn't rely on the outgoing requests to firebase [16:56:59] so it didn't fail [16:57:29] sure, but any rpcs that got routed there would fail, right? [16:57:51] yes [16:58:43] actually no, the req wouldn't fail because for security reasons we queue up outgoing messages and send all of them together after some time [16:58:57] ah [16:59:28] Does the current http proxy setup support outgoing ipv6 requests ? [16:59:29] so (hypothetically) you would have to deploy just to canary, watch for a successful result, and then continue [17:01:14] nemo-yiannis: yes, and I can connect to '[2001:4860:4802:36::37]:443' just fine from e.g. install1004 which is one of the machines running the proxies [17:01:23] ok [17:01:48] i think one other change that was introduced that we saw in the tests was that firebase starting using http2 [17:01:55] but tests were passing on CI [17:02:03] would that be a potential cause ? [17:04:09] meanwhile i am checking if http proxy config changed in the new version [17:05:15] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10313473 (10JMeybohm) 05Resolved→03Open Unfortunately this worked only once. Now the PXE boot hangs right after "All rights reserved." with no f... [17:11:05] nemo-yiannis: I manually edited the eqiad staging deployment to set the https_proxy environment variable as well, can you check if staging works now [17:12:07] it seems possible they added a new code path for which the existing `httpAgent` in the nodejs api didn't work [17:13:08] i double checked the api docs and the httpAgent parameter is the same, but maybe the agent instance we use doesnt work well with the the latest sdk [17:22:02] cdanis: i sent some requests on staging [17:22:11] lets wait for the queue to send messages [17:23:14] ugh, it failed for a completely different reason on staging which was kinda expected [17:23:25] anyway i will try things locally with an http proxy [17:23:44] i believe the problem is that firebase now uses http2 and our agent implementation doesn't support it [17:24:37] yeah I was just about to say [17:24:44] nemo-yiannis: so, I think you ought to be able to replicate this locally [17:28:36] 06serviceops, 06SRE, 10Release-Engineering-Team (Radar), 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10313587 (10brennen) a:05brennen→03dancy [17:28:40] cdanis: do you think i can use the same `url-downloader.eqiad.wikimedia.org:8080` locally? [17:32:22] 06serviceops, 06SRE, 10Release-Engineering-Team (Radar), 05Train Deployments: MW script "eval.php" failing for "testcommonswiki" during train operations - https://phabricator.wikimedia.org/T379628#10313583 (10brennen) 05Open→03Resolved a:03brennen > scap 4.123.0 has been deployed which should add... [17:39:54] nemo-yiannis: one minute [17:41:43] nemo-yiannis: https://phabricator.wikimedia.org/P71024 [17:41:50] you should be able to hack up `app` there to test locally :) [17:41:59] ok thanks! [17:44:00] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10313650 (10JMeybohm) I've asked @VRiley-WMF / @Jclark-ctr on IRC if they could switch the cable from Slot 2 to Slot 1 (our default) to maybe convin... [17:50:12] 06serviceops, 06Wikipedia-Android-App-Backlog: Timeout errors when making requests to Firebase for push notifications - https://phabricator.wikimedia.org/T379647#10313678 (10CDanis) Given the shape of the error message and stack trace, I suspect that there's some new code path in FCM or its dependencies for wh... [18:02:17] 06serviceops, 06SRE, 10Wikimedia-Site-requests, 07Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893#10313765 (10MaryMunyoki) [18:03:19] 06serviceops, 06SRE, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#10313778 (10MaryMunyoki) [18:33:48] 06serviceops, 10MediaWiki-Platform-Team (Radar), 10MW-1.44-notes (1.44.0-wmf.4; 2024-11-19): Regenerate UcfirstOverrides.php for PHP 7.4 -> 8.1 transition - https://phabricator.wikimedia.org/T372603#10314041 (10Scott_French) For the record, the "verified consistent 7.4-like title-case behavior" part of T3726... [18:45:51] cdanis: i managed to reproduce the issue locally. Thanks for the docker compose snippet. [18:46:05] nemo-yiannis: excellent, did setting http_proxy fix it? [18:46:12] no [18:46:15] interesting [18:46:23] sounds like an upstream bug then [18:48:42] maybe something changed in the http agent config that i am missing, i will take a look tomorrow now that i can reproduce it 👋 [18:49:11] cheers :) [19:13:56] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: wikikube-ctrl1001.eqiad.wmnet fails PXE boot - https://phabricator.wikimedia.org/T379629#10314144 (10JMeybohm) @Jclark-ctr reseated the cable into Slot 1 and while the link did not immediately show up via LED or iDRAC web-ui, it was show... [19:51:12] 06serviceops, 10MW-on-K8s: Support output files in mwscript-k8s - https://phabricator.wikimedia.org/T379675 (10Scott_French) 03NEW [20:13:23] 06serviceops, 10MW-on-K8s: Support output files in mwscript-k8s - https://phabricator.wikimedia.org/T379675#10314304 (10Scott_French) p:05Triage→03Low Triaging this as low-priority initially, since no other critical use cases for this functionality have surfaced yet, and the title-case mapping use case is... [22:21:31] 06serviceops, 06Data Products, 07Epic: SDS 2.1.1 Evaluations of 3rd part Experimentation Platform by SRE Service Ops - https://phabricator.wikimedia.org/T369174#10314737 (10odimitrijevic) Hi @Legoktm, the linked document has been abandoned and is not longer under consideration. [22:24:29] 06serviceops, 06Data Products, 07Epic: SDS 2.1.1 Evaluations of 3rd part Experimentation Platform by SRE Service Ops - https://phabricator.wikimedia.org/T369174#10314743 (10odimitrijevic) 05Open→03Resolved This task along with https://phabricator.wikimedia.org/T369178 can be closed since the POCs are... [23:32:49] 06serviceops, 10MW-on-K8s, 10TimedMediaHandler, 13Patch-For-Review, 07Video: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241#10315018 (10Scott_French) Following up on T356241#10291014: * With `max.poll.interval.ms` now set to 1h around 16:07 UTC, we've seen only a singl...