[00:04:11] 06serviceops, 10DNS, 06SRE, 06Traffic, 13Patch-For-Review: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803#10846254 (10Scott_French) I just chatted with @jasmine_, who is interested in helping to deploy this change. Many thanks for preparing a patch, @Dzahn! [09:19:05] 06serviceops: Ensure configcluster bootstraps cleanly - https://phabricator.wikimedia.org/T318699#10847262 (10Clement_Goubert) 05Stalled→03Declined [09:20:03] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900#10847264 (10Clement_Goubert) a:05Clement_Goubert→03None [09:20:50] 06serviceops, 07Datacenter-Switchover: imagecatalog_record.service fails due to read-only sqlite database - https://phabricator.wikimedia.org/T360652#10847266 (10Clement_Goubert) a:05Clement_Goubert→03None [09:21:02] 06serviceops, 10Add-Link, 06Growth-Team, 10Observability-Tracing, and 3 others: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122#10847267 (10Clement_Goubert) a:05Clement_Goubert→03None [09:21:56] 06serviceops, 06Data-Persistence, 06SRE, 07Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907#10847269 (10Clement_Goubert) a:05Clement_Goubert→03None [09:22:25] 06serviceops, 10MediaWiki-Platform-Team (Radar): MediaWikiCronJobFailed - https://phabricator.wikimedia.org/T391574#10847270 (10Clement_Goubert) 05In progress→03Resolved [09:28:31] 06serviceops, 10MW-on-K8s, 06Trust and Safety Product Team, 10MediaModeration (MediaModeration 2.1): Migrate MediaModeration jobs to mw-cron - https://phabricator.wikimedia.org/T385799#10847278 (10Clement_Goubert) The failed job is because of the timeout in the command. It makes the container exit with cod... [10:01:12] 06serviceops, 10MW-on-K8s, 06Trust and Safety Product Team, 10MediaModeration (MediaModeration 2.1), 13Patch-For-Review: Migrate MediaModeration jobs to mw-cron - https://phabricator.wikimedia.org/T385799#10847424 (10Clement_Goubert) Removed the timeout, the current run will end up in error as well, and... [10:18:09] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10847459 (10Michael) Hey @hnowlan or @Clement_Goubert, we are getting an alert about the `growthexperiments-refreshlinkrecom... [10:22:40] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10847468 (10Clement_Goubert) If you look at `kubectl describe job` for these two: ` cgoubert@deploy1003:/srv/deployment-char... [10:23:19] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10847474 (10hnowlan) It looks like one of the runs did fail - the logs can be seen in `kubectl logs growthexperiments-refres... [10:23:30] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10847476 (10Clement_Goubert) Before I do delete it, here's the error ` xmfwiki fetching 500 tasks... RuntimeException f... [10:30:30] hi, stupid question, where is the mw-cron jobs living? I want to make a change to one of them but I want to make sure to use the right Hosts: [10:30:43] (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1149344) [10:31:16] Amir1: That lives in k8s now [10:31:23] so deploy1003.eqiad.wmnet [10:31:34] deploy1003, thanks! [10:32:07] Once merged, run puppet there, cd /srv/deployment-charts/helmfile.d/services/mw-cron; helmfile -e eqiad -i apply --context 5 [10:32:17] (or we can do it, your call) [10:32:35] oh I didn't know we should do that [10:32:45] though diff should be noop afaict? [10:32:57] this one is noop but I'm adding a new cluster in the next patch [10:33:01] ack [10:33:14] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10847497 (10hnowlan) 05In progress→03Resolved a:05Clement_Goubert→03hnowlan [10:33:42] I need to add a "Deployment" section to https://wikitech.wikimedia.org/wiki/Mw-cron_jobs [10:34:50] yeah, a heads up to rest of SREs would be nice too. I wouldn't even looked for docs (thinking puppet agent will take care of it) [10:47:17] Amir1: On it. [10:47:28] Thank you <3 [10:59:21] Amir1: https://wikitech.wikimedia.org/wiki/Mw-cron_jobs#Deploying_periodic_jobs does that make sense? [10:59:45] amazing. Thanks! [11:12:11] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Implement periodic maintenance scripts for mw-on-k8s - https://phabricator.wikimedia.org/T341555#10847643 (10A_smart_kitten) >>! In T341555#10834838, @Clement_Goubert wrote: >>>! In T341555#10832816, @A_smart_kitten wrote: >> Thank you for the detailed response... [11:14:39] Hey! I have a patch for removing a bunch of changeprop rules for PCS: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1148273 and PCS is in prod for a week now. Is there anyone available to take a look ? [11:29:42] Amir1: did it go ok? [11:29:57] still running puppet agent [11:30:26] ack [11:30:42] We'll have to dig into why it's *so* long to run puppet on the deployment server someday [11:31:20] yeah... [11:31:25] it's really slow [11:31:41] it is [11:31:48] believe me I know x) [11:34:35] claime: deployed, all went well. It's a massive template for a cronjob 🙀 [11:34:50] Amir1: yeah [11:35:07] It's got all the pod spec and all that jazz [11:41:58] 06serviceops, 06Data-Persistence, 10MW-on-K8s: Migrate ParserCachePurging jobs to mw-cron - https://phabricator.wikimedia.org/T385800#10847725 (10hnowlan) db_stats_lag migrated and looks like execution times shouldn't be a problem. See a few successful runs, and run times look like: ` Start Time: Thu, 22... [11:42:20] 06serviceops, 06Data-Persistence, 10MW-on-K8s: Migrate ParserCachePurging jobs to mw-cron - https://phabricator.wikimedia.org/T385800#10847727 (10hnowlan) 05In progress→03Resolved a:05Clement_Goubert→03hnowlan [13:07:14] 06serviceops, 06DBA, 10Editing-team (Tracking), 10MW-1.45-notes (1.45.0-wmf.1; 2025-05-13), and 2 others: Fatal exception of type "DBUnexpectedError: Database servers in extension1 are overloaded." affecting page views - https://phabricator.wikimedia.org/T393513#10848029 (10Ladsgroup) 05Open→03Resol... [13:18:27] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10848073 (10JMeybohm) 05Open→03Stalled Stalled by {T387854} [13:20:21] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow members of restricted to run maintenance scripts - https://phabricator.wikimedia.org/T378429#10848082 (10JMeybohm) 05Open→03Resolved Resolving this again since the updated patch does, for some reason, no longer trigger the race. [15:25:24] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10848967 (10akosiaris) I did run the simple one `lang=bash deploy1003:~# ping -M do -s 1433 10.64.65.59 PING 1... [15:30:42] 06serviceops, 06Language and Product Localization, 13Patch-For-Review: Migrate language_and_product_localization jobs to mw-cron - https://phabricator.wikimedia.org/T388539#10849025 (10Clement_Goubert) [15:32:32] 06serviceops, 10MW-on-K8s, 06Trust and Safety Product Team, 10MediaModeration (MediaModeration 2.1): Migrate MediaModeration jobs to mw-cron - https://phabricator.wikimedia.org/T385799#10849051 (10Clement_Goubert) [15:32:48] 06serviceops, 10MW-on-K8s, 06Trust and Safety Product Team, 10MediaModeration (MediaModeration 2.1): Migrate MediaModeration jobs to mw-cron - https://phabricator.wikimedia.org/T385799#10849055 (10Clement_Goubert) 05In progress→03Resolved Deleted the failed job to clear the alert, jobs are being co... [15:38:32] 06serviceops, 10MediaWiki-Special-pages: Migrate updatequerypages/update_special_pages/initsitestats jobs to mw-cron - https://phabricator.wikimedia.org/T388534#10849077 (10Clement_Goubert) 05In progress→03Resolved [15:52:01] 06serviceops, 06Security-Team, 13Patch-For-Review, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10849157 (10Clement_Goubert) Migration trial run didn't work ` cgoubert@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-cron$ kubectl logs gene... [15:52:17] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10849159 (10cmooney) Thanks @akosiaris that's great. TIL about that nmap script, that's really useful. Also th... [16:06:13] 06serviceops, 10observability: Stale labels applies upon terminated job pod IP reuse - https://phabricator.wikimedia.org/T395052 (10Scott_French) 03NEW [16:06:23] 06serviceops, 06Security-Team, 13Patch-For-Review, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10849236 (10Clement_Goubert) There's a couple issues: # Missing `/usr/share/fonts/truetype/freefont/FreeMonoBoldOblique.ttf` => install the font package... [16:16:07] 06serviceops, 06Security-Team, 07Security: MediaWiki periodic job generatecaptcha failed - https://phabricator.wikimedia.org/T395051#10849255 (10A_smart_kitten) Tagging owning team & #serviceops (as the team handling the k8s cron job migrations). (side note, should the automatic @phaultfinder task creations... [16:25:40] 06serviceops, 06Security-Team, 07Security: MediaWiki periodic job generatecaptcha failed - https://phabricator.wikimedia.org/T395051#10849279 (10A_smart_kitten) Ah wait, it looks like this failure is already known about (xref T388531#10849157). Will leave it to the teams involved whether this should be merge... [16:27:24] 06serviceops, 06Security-Team, 07Security: MediaWiki periodic job generatecaptcha failed - https://phabricator.wikimedia.org/T395051#10849284 (10sbassett) →14Duplicate dup:03T388531 [16:27:30] 06serviceops, 06Security-Team, 13Patch-For-Review, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10849286 (10sbassett) [16:48:19] 06serviceops, 06Security-Team, 13Patch-For-Review, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10849409 (10Clement_Goubert) Reverting for now, the font issue can be solved quickly, but fixing the shellout is going to be a little more complex. We ba... [16:48:22] 06serviceops, 06Security-Team, 13Patch-For-Review, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10849410 (10Clement_Goubert) 05Open→03Stalled [17:06:04] 06serviceops, 06Trust and Safety Product Team, 13Patch-For-Review: Migrate trust_and_safety_product_team jobs to mw-cron - https://phabricator.wikimedia.org/T388542#10849482 (10hnowlan) 05In progress→03Resolved a:03hnowlan [17:08:16] 06serviceops, 06MediaWiki-Engineering: Clean up UcfirstOverrides.php following PHP 7.4 -> 8.1 transition - https://phabricator.wikimedia.org/T394556#10849493 (10MSantos) [17:15:58] 06serviceops, 10Discovery-Search (2025.05.02 - 2025.05.23), 13Patch-For-Review: Migrate discovery-search jobs to mw-cron - https://phabricator.wikimedia.org/T388538#10849515 (10Clement_Goubert) `wikidata-updateQueryServiceLag` needs an egress networkpolicy to reach prometheus, reverted to `mwmaint` for now. [19:14:59] 06serviceops, 10observability: Stale labels applied when the pod IP of a terminated k8s job is reused - https://phabricator.wikimedia.org/T395052#10850345 (10Scott_French) [21:12:57] 06serviceops, 10observability: Stale labels applied when the pod IP of a terminated k8s job is reused - https://phabricator.wikimedia.org/T395052#10850728 (10Scott_French) [21:23:00] 06serviceops, 10observability: Stale labels applied when the pod IP of a terminated k8s job is reused - https://phabricator.wikimedia.org/T395052#10850759 (10Scott_French) Just came across https://github.com/prometheus/prometheus/issues/10755, which indeed recommends using `__meta_kubernetes_pod_phase` to skip... [21:34:04] 06serviceops: Turn down MediaWiki image builds for PHP 7.4 - https://phabricator.wikimedia.org/T391057#10850789 (10Scott_French) 05Open→03Resolved Since a bit before 17:30 UTC today, we are no longer building PHP 7.4 ("publish" flavour) MediaWiki images during scap deployments. No further work is tracked... [21:34:41] 06serviceops, 06MediaWiki-Engineering: Clean up UcfirstOverrides.php following PHP 7.4 -> 8.1 transition - https://phabricator.wikimedia.org/T394556#10850794 (10Scott_French) [22:39:27] 06serviceops, 10observability, 13Patch-For-Review: Stale labels applied when the pod IP of a terminated k8s job is reused - https://phabricator.wikimedia.org/T395052#10850903 (10Scott_French) [22:47:41] 06serviceops, 06SRE Observability, 13Patch-For-Review: Stale labels applied when the pod IP of a terminated k8s job is reused - https://phabricator.wikimedia.org/T395052#10850908 (10Scott_French) Alright, I think https://gerrit.wikimedia.org/r/1149505 should do what's needed, in the three "sufficiently broad...