[06:41:46] 06serviceops, 06SRE, 06Traffic-Icebox, 06Trust and Safety Product Team: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933#10790816 (10kostajh) 05Open→03Declined [07:08:55] Can anyone look at why machinetranslation (mint) is experencing some pods not able to up after 6 or 7 hours? [07:50:16] kart_: machinetranslation on eqiad and codfw is deployed with 2 replicas each - all of them ready afaict. What is it you are seeing/expecting? [08:01:33] jayme: interesting. I see traffic drop and increase in translation time after particular time. Let me paste logs. [08:08:46] [2025-05-05 06:58:42 +0000] [540868] [INFO] Booting worker with pid: 540868 [08:08:46] and then, [08:08:46] [2025-05-05 07:49:27 +0000] [1] [ERROR] Worker (pid:540868) was sent SIGKILL! Perhaps out of memory? [08:09:18] Usually, worker comes back after booting, but that's not happening here. [08:12:21] kart_: how do you tell how many workers are running/active per pod? [08:14:29] We've: GUNICORN_WORKERS: 4 # Match available CPUs [08:15:52] that's config I suppose...I mean right now - as you're saying the worker does not come up again [08:16:24] Not, just log entries. [08:16:24] from the service dashboards it seems like it's pretty normal to constantly hit it's memory limit fwiw: https://grafana-rw.wikimedia.org/d/qrMmIJy4z/machinetranslation?orgId=1&refresh=1m&from=now-7d&to=now&viewPanel=79 [08:20:24] or at least miore frequent since ~04:00Z this morning [08:21:18] Yes. [08:22:45] given the CPU and memory usage seemed to have increased while request rate dropped I would assume more expensive requests being served if nothing changed in code [08:25:57] but I think I'm not able to properly interpret the service specific metrics like character count and translation time [08:25:58] Yes. No change in the code. I'll take a look at that POV. [12:53:47] 06serviceops, 06Discovery-Search, 07Analytics-Data-Problem, 10Data-Platform-SRE (2025-05-02 - 2025-05-23): Search Update Pipeline requests to Action API are logged as coming from 127.0.0.1 - https://phabricator.wikimedia.org/T388855#10791953 (10Gehel) [12:58:45] 06serviceops, 10Continuous-Integration-Config, 13Patch-For-Review: Several recent slow (>15 minute) helm-lint job runs - https://phabricator.wikimedia.org/T387781#10792001 (10hashar) 05Stalled→03Declined I have made long explanation on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1... [13:23:33] Hello ServiceOps, is anyone able to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/838182 ? It's a change to envoy that'll let us enable/disable Elastic at the DC level without having to make a mwconfig change [13:27:04] 06serviceops, 06Traffic, 10Content-Transform-Team (Work In Progress), 07Essential-Work, 13Patch-For-Review: Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10792134 (10MSantos) [13:27:21] 06serviceops, 06Traffic, 10Content-Transform-Team (Work In Progress), 07Essential-Work, 13Patch-For-Review: Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10792135 (10MSantos) a:03Jgiannelos [14:01:12] 06serviceops, 06Traffic, 10Content-Transform-Team (Work In Progress), 07Essential-Work, 13Patch-For-Review: Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10792370 (10Jgiannelos) After a bit of investigation here is were I am at: * For a... [14:01:37] 06serviceops, 06Traffic, 10Content-Transform-Team (Work In Progress), 07Essential-Work, 13Patch-For-Review: Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10792385 (10Jgiannelos) cc @hnowlan [14:05:02] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 13Patch-For-Review: Kubernetes dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390857#10792408 (10elukey) 05Open→03Resolved a:03elukey [14:20:59] 06serviceops, 10Discovery-Search (2025.05.02 - 2025.05.23): Migrate discovery-search jobs to mw-cron - https://phabricator.wikimedia.org/T388538#10792531 (10Gehel) [15:45:34] 06serviceops, 06Traffic, 10Content-Transform-Team (Work In Progress), 07Essential-Work, 13Patch-For-Review: Purging edge caches doesn't work for articles with ":" in their title - https://phabricator.wikimedia.org/T392849#10793002 (10Jgiannelos) 05Open→03Resolved I just verified this change in pr... [15:49:34] 06serviceops, 06Moderator-Tools-Team: Migrate moderator-tools jobs to mw-cron - https://phabricator.wikimedia.org/T393395 (10Scott_French) 03NEW [15:49:54] 06serviceops, 06Moderator-Tools-Team: Migrate moderator-tools jobs to mw-cron - https://phabricator.wikimedia.org/T393395#10793042 (10Scott_French) [15:49:58] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Implement periodic maintenance scripts for mw-on-k8s - https://phabricator.wikimedia.org/T341555#10793043 (10Scott_French) [15:52:32] 06serviceops, 06Community-Tech, 13Patch-For-Review: Migrate community-tech jobs to mw-cron - https://phabricator.wikimedia.org/T388536#10793054 (10Scott_French) [17:24:02] 06serviceops, 10Observability-Alerting, 07Kubernetes, 10SRE Observability (FY2024/2025-Q4): Alert on unscrapable pods - https://phabricator.wikimedia.org/T372242#10793378 (10JMeybohm) The problem I see here is that we configure the `k8s-pods` job to scrape all configured `containerPort`s of a pod if more t... [17:36:16] 06serviceops, 06Community-Tech: Migrate community-tech jobs to mw-cron - https://phabricator.wikimedia.org/T388536#10793410 (10Scott_French) The LoginNotify and PageAssessments jobs have both been migrated. I'll follow up later today to confirm their first scheduled runs succeed (23:00 and 20:42 UTC respective... [18:07:47] 06serviceops, 06Moderator-Tools-Team, 13Patch-For-Review: Migrate moderator-tools jobs to mw-cron - https://phabricator.wikimedia.org/T393395#10793535 (10Scott_French) Per discussion in #talk-to-moderator-tools, the desired phabricator tag for notifications is #moderator-tools-team. The pending patches will... [20:50:18] 06serviceops, 06Community-Tech, 13Patch-For-Review: Migrate community-tech jobs to mw-cron - https://phabricator.wikimedia.org/T388536#10793948 (10Scott_French) [23:11:29] 06serviceops, 06Community-Tech, 13Patch-For-Review: Migrate community-tech jobs to mw-cron - https://phabricator.wikimedia.org/T388536#10794266 (10Scott_French) [23:11:46] 06serviceops, 06Community-Tech, 13Patch-For-Review: Migrate community-tech jobs to mw-cron - https://phabricator.wikimedia.org/T388536#10794267 (10Scott_French) 05Open→03Resolved a:03Scott_French Both jobs have now had a successful first run: ` $ kubectl describe jobs/pageassessments-cleanup-29107...