[08:52:39] 06serviceops, 06MW-Interfaces-Team, 10RESTBase Sunsetting, 06Traffic: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints - https://phabricator.wikimedia.org/T374683#10227914 (10akosiaris) Hi, I just had enough time to review this. This can't be implemente... [08:54:40] 06serviceops, 10Electron-PDFs: Download to PDF: HTTP 500 error on some wikis for some users - https://phabricator.wikimedia.org/T376438#10227922 (10TheDJ) I also experience it for https://en.wikiversity.org/api/rest_v1/page/pdf/Motivation_and_emotion%2FBook%2F2024%2FDopamine_and_social_behaviour I don't see a... [09:30:43] 06serviceops, 06Content-Transform-Team, 10Electron-PDFs: Download to PDF: HTTP 500 error on some wikis for some users - https://phabricator.wikimedia.org/T376438#10228049 (10akosiaris) Adding content transform too. [09:34:07] 06serviceops, 06Content-Transform-Team, 10Electron-PDFs: Download to PDF: HTTP 500 error on some wikis for some users - https://phabricator.wikimedia.org/T376438#10228068 (10hnowlan) This appears to be a rerun of T375521 - temporary fix last time was a roll restart, but there's clearly a deeper issue. [10:00:07] akosiaris: o/ qq - do we still use parsoid.svc.{eqiad,codfw}.wmnet ? [10:01:45] there are some alerts for TLS certs expiring https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=parsoid, but for the Puppet CA [10:02:06] I guess those are old and should be destroyed, but the records have IP addresses [10:02:13] so not 100% sure where to look :D [10:10:05] elukey: the cluster does not exist anymore, but I think we need to check if they're not used by the parsoid test servers [10:11:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042936 has been lingering for a while [10:30:00] claime: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072737 [10:30:07] I found it in the task [10:30:32] so I think Alex already cleaned up everything, the last step should probably be to destroy the cert in the puppet CA [10:30:53] yeah and remove the dns records [10:31:09] https://gerrit.wikimedia.org/r/c/operations/dns/+/108025 [10:31:39] the link seems broken on my side :( [10:31:49] https://gerrit.wikimedia.org/r/c/operations/dns/+/1080254 [10:31:56] bad copy x) [10:34:25] +1ed :) [10:36:24] ty [10:39:44] shall I destroy the TLS certs? [10:47:45] yeah I think you're good to go, akosiaris can maybe confirm? [11:31:04] I ll have a look in a few [11:41:38] elukey: yeah, those are old and not in use, remove them [11:43:35] claime: should we merge this? I just rebased it. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042936 [11:45:05] Have to check they're not used by mwdebug or some stuff like that [11:45:17] I'll check in the afternoon [12:49:55] akosiaris: both destroyed :) [14:31:02] 06serviceops, 06MW-Interfaces-Team, 10RESTBase Sunsetting, 13Patch-For-Review: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints - https://phabricator.wikimedia.org/T374683#10229418 (10akosiaris) After discussing with @hnowlan, I think I was wrong. We alre... [14:31:39] 06serviceops, 06Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting: Change changeprops rules to pre-generate/invalidate cache directly to PCS rather than in restbase - https://phabricator.wikimedia.org/T348996#10229437 (10MSantos) a:03Jgiannelos [14:44:09] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review, 10Sustainability (Incident Followup): Remove memory limits from critical cluster components (calico) - https://phabricator.wikimedia.org/T376976#10229555 (10JMeybohm) 05Open→03Resolved Alert rules have been deployed last week. The... [14:47:02] 06serviceops, 10observability, 10Observability-Logging, 10Prod-Kubernetes, and 2 others: containerd logs are not properly parsed during ingestion to logstash - https://phabricator.wikimedia.org/T377132#10229550 (10tappof) It sounds better to me than yesterday. We can use the grok pattern below to parse the... [14:47:18] 06serviceops, 10observability, 10Observability-Logging, 10Prod-Kubernetes, and 2 others: containerd logs are not properly parsed during ingestion to logstash - https://phabricator.wikimedia.org/T377132#10229595 (10JMeybohm) Sounds good, thanks. Alternatively, could we do the JSON transformation in rsyslog... [14:57:02] 06serviceops, 06MW-Interfaces-Team, 10RESTBase Sunsetting, 13Patch-For-Review: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints - https://phabricator.wikimedia.org/T374683#10229670 (10akosiaris) @HCoplin-WMF The internal mappings on rest-gateway work, all... [15:51:30] 06serviceops, 06MW-Interfaces-Team, 10RESTBase Sunsetting, 13Patch-For-Review: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints - https://phabricator.wikimedia.org/T374683#10229927 (10akosiaris) [16:16:23] swfrench-wmf, akosiaris: lol, mwscript-cleanup crashed because it tried to destroy the prometheus release, I'll... do something about that 🤦 [16:17:35] actually kind of interesting why it failed! helmfile complained about `duplicate release "prometheus" found in namespace "mw-script": there were 2 releases named "prometheus" matching specified selector` [16:20:06] * swfrench-wmf did not even realize we were using it there, heh [16:21:15] we weren't until just now! [16:23:40] ah, I see! but also ... wait, what is up with this error message [16:26:38] great question [16:26:58] (looking at the journal log for the unit and am puzzled) [16:28:08] OH lol [16:28:15] it's because we also passed RELEASE_NAME=prometheus [16:28:31] so in the helmfile it's defined once directly and once via the template [16:28:47] * swfrench-wmf facepalms [16:28:52] this rules [16:28:55] yup, that checks out [16:28:59] nice find [16:29:22] okay so actually hm, I mailed you a patch that excludes it at mwscript-cleanup, and that's the right thing to do, but we might also want to have the helmfile error out if you try to pass RELEASE_NAME=prometheus [16:29:34] I guess it already does, but, error out more explicably [16:29:45] nothing good would ever come of that [16:31:32] +1 and yeah, that would definitely be nice to have [16:53:36] that did it, SystemdUnitFailed alerts should clear shortly 👍 [17:52:44] LoL, missed that. Thanks for fixing [20:18:44] 06serviceops, 13Patch-For-Review: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766#10231249 (10Scott_French) 05Open→03Resolved The keys associated with the now-expired certificate were removed from private in 603c0251c89dbfb8a0075d79b8244260497cf216 (yes, th... [20:19:47] 06serviceops, 10Wikimedia-Site-requests, 10WMF-General-or-Unknown, 13Patch-For-Review: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923#10231242 (10Pppery) Tagging #serviceops to review the Puppet patch. [20:48:42] 06serviceops, 10observability, 10Observability-Logging, 10Prod-Kubernetes, and 2 others: containerd logs are not properly parsed during ingestion to logstash - https://phabricator.wikimedia.org/T377132#10231388 (10colewhite) RSyslog in this pipeline simply adds metadata and forwards it on - this is still w... [21:20:43] 06serviceops, 06Data-Persistence, 13Patch-For-Review: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996#10231579 (10Eevans) >>! In T363996#10226099, @hnowlan wrote: >>>! In T363996#10220536, @elukey wrote: >> @hnowlan if echostore turns out to wo... [21:57:42] 06serviceops, 06SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Portals, 13Patch-For-Review: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10231718 (10Pppery) [22:10:57] 06serviceops, 13Patch-For-Review: Prepare PHP 8.1 service images for Shellbox - https://phabricator.wikimedia.org/T374502#10231746 (10jijiki) p:05Triage→03Medium