[07:43:24] 06Machine-Learning-Team, 06SRE, 10SRE-Access-Requests: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11326220 (10elukey) Correct this needs an approval from Mark afaik :) @mark Hi! Looping you in to approve the ops membership for Dawid (new Staff SRE in ML). [08:35:04] good morning [09:26:43] morning [09:55:50] Dzień dobry! [10:05:50] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11326593 (10Miriam) @kevinbazira based on @Trokhymovych 's feedback and after re-reading the... [10:12:17] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11326603 (10kevinbazira) >>! In T406179#11326592, @Miriam wrote: > What would be the extra wo... [10:26:04] :) [10:40:23] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11326692 (10Miriam) Oh wonderful @kevinbazira sorry I misunderstood from your msg that you wa... [11:04:33] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11326742 (10Trokhymovych) Hi @Miriam I am working on collecting a binary of all components... [11:29:08] 06Machine-Learning-Team, 06SRE, 10SRE-Access-Requests: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11326775 (10DPogorzelski-WMF) a:03mark [11:31:02] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790 (10OKarakaya-WMF) 03NEW [11:38:29] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11326800 (10OKarakaya-WMF) [11:40:18] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11326811 (10OKarakaya-WMF) [11:41:14] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11326820 (10OKarakaya-WMF) As discussed, I'm creating a [new goal](https://phabricator.wikimedia.org/T408790) for deployments. and I'm closing thi... [11:47:34] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11326871 (10OKarakaya-WMF) [12:27:22] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11326948 (10OKarakaya-WMF) [12:27:52] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11326950 (10OKarakaya-WMF) [12:29:17] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11326952 (10OKarakaya-WMF) [12:32:23] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 10PersonalDashboard: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11326955 (10DMburugu) [12:33:45] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11326961 (10OKarakaya-WMF) [12:46:55] Hello, [12:46:55] Can you take look to this patch?: https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/1199815 [12:46:55] It enables deploying models from the new location. It will deploy new models when it appears in this list: https://analytics.wikimedia.org/published/wmf-ml-models/addalink/v2/wikis.txt Otherwise, it will use the v1 model. I think Martin is off today and our team is committed to rest of the goal https://phabricator.wikimedia.org/T408790 @kevinbazira [12:52:16] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11327000 (10OKarakaya-WMF) [12:52:50] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11327002 (10OKarakaya-WMF) [13:15:30] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11327064 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by elukey@cumin1003 pool for host ml-serve2001.codfw.wmnet completed: - ml-serve2001.codfw.w... [13:15:38] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11327067 (10elukey) 05Open→03Resolved a:03elukey Host repooled! [13:53:44] FIRING: LiftWingServiceErrorRate: ... [13:53:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=svwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:28:51] RESOLVED: LiftWingServiceErrorRate: ... [14:28:51] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=svwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:29:20] --^ looking [14:30:36] Looks like this was svwiki, with timouts (resp code 0, client disconnect) [14:31:04] Started having elevated count of that around 1320 UTC [14:41:07] Looks like the predict step was slow, e.g.: [14:41:22] 2025-10-30 13:26:28.953 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 23.59904289245605 [14:41:24] [14:52:24] saw this on logstash: error reverse proxying request; sockstat: sockets: used 178 [14:52:24] TCP: inuse 153 orphan 3 tw 57 alloc 9077 mem 688 [14:52:24] UDP: inuse 0 mem 8106 [14:52:24] UDPLITE: inuse 0 [14:52:24] RAW: inuse 0 [14:52:25] FRAG: inuse 0 memory 0 [14:52:38] in the queue-proxy container [14:53:13] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11327645 (10OKarakaya-WMF) [14:54:05] what's that? [15:01:37] aiko: it may be the Knative queue proxy container, it sits between istio and kserve IIRC [15:02:54] the svwiki's kserve pod CPU was maxed out https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=000000026&var-site=codfw&var-cluster=k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-pod=svwiki-damaging-predictor-default-00029-deployment-68bbb56k9hlx&var-container=$__all&from=now-6h&to=now&timezone=utc [15:03:19] it follows what Tobias found, I think it is the usual problem of the heavy rev-ids [15:03:24] cc: aiko --^ [16:02:14] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11328139 (10Michael) @AikoChou, @BWojtowicz-WMF and I have met and discussed thi... [16:13:44] FIRING: LiftWingServiceErrorRate: ... [16:13:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:28:44] RESOLVED: LiftWingServiceErrorRate: ... [16:28:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:55:46] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11328912 (10SBisson) 05Open→03In progress [19:01:40] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11329271 (10achou) It’s good that we’re discussing this! I've learned a lot :)... [19:52:08] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11329384 (10Ottomata) > We won't use a different source unit, so I think includi... [21:48:56] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11329865 (10Eevans) Ok, I've updated https://gitlab.wikimedia.org/repos/sre/data...