[06:20:53] 06Machine-Learning-Team, 06Wikimedia Enterprise: Test liftwing wikidata revert risk API for scale and latency - https://phabricator.wikimedia.org/T409388#11422598 (10kevinbazira) @prabhat, has the WME team had a chance to run scale and latency tests on the revertrisk-wikidata inference service? Does this serv... [08:00:21] hey folks! [08:00:52] I have noticed the recommendation API alerts firing a lot recently, from a quick look it seems a client that often timeouts [08:01:02] the number of requests are not high, but we should check anyway [08:01:11] who's oncall this week? I can help if needed [08:24:53] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11422678 (10DPogorzelski-WMF) hmmm doesn't seem like it. Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits ----... [08:27:44] FIRING: LiftWingServiceErrorRate: ... [08:27:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:35:45] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11422694 (10elukey) yeah I think `amd.com/gpu: 1` wasn't added when deploying aya, only tolerations, that would explain the result.. [08:39:50] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11422701 (10elukey) My bad, the GPU is there: ` root@deploy2002:~# kubectl exec aya-llm-predictor-00015-deployment-65b4577748-6wh2c -n llm -- ls /dev/dri card1 renderD128 ` And the limits... [08:47:44] RESOLVED: LiftWingServiceErrorRate: ... [08:47:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:07:12] o/ it's me on call this week [09:07:21] the issue's been happening since last week [09:08:49] Bartosz looked into it.. "a lot of examples of 503 Service Unavailable logs when trying to call `http://localhost:6015/v2/suggest/sections/...` Those also happened during times of high-throughput and resolved automatically after a few minutes." [09:09:06] and it might be closely related to https://phabricator.wikimedia.org/T406854 [09:09:14] and https://phabricator.wikimedia.org/T381438 [09:09:20] I'll investigate further [09:13:57] aiko: o/ ack! From the istio dashboard it also seems that the "0" response code is logged, that envoy/istio for the client giving up and aborting the connection (maybe because of a timeout or similar) [09:28:26] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1212555 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [09:30:37] elukey: ack! [09:31:41] (03CR) 10Bartosz Wójtowicz: [C:03+2] revise-tone-task-generator: Use BatchQuery to optimise Cass writes. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1212555 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [09:41:32] (03Merged) 10jenkins-bot: revise-tone-task-generator: Use BatchQuery to optimise Cass writes. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1212555 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [10:01:27] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11422864 (10DPogorzelski-WMF) >>! In T394778#11419023, @akosiaris wrote: > Thanks for the nice discussion everyone. Overall, I think with the suggestion of building images on a dedicated... [11:36:53] (03PS3) 10Nik Gkountas: collection recommender: split into two classes [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1212135 [12:43:17] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11423340 (10DPogorzelski-WMF) Most likely, I'm currently looking at the builder machine so will come back to this [12:49:01] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for first batch of wikis: < 1000 monthly edits - https://phabricator.wikimedia.org/T411485 (10DMburugu) 03NEW [12:50:04] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for first batch of wikis: < 1000 monthly edits - https://phabricator.wikimedia.org/T411485#11423389 (10DMburugu) [12:50:05] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11423390 (10DMburugu) [12:55:38] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for second batch of wikis: > 1000 AND <= 2000 monthly edits - https://phabricator.wikimedia.org/T411487 (10DMburugu) 03NEW [12:56:18] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for second batch of wikis: > 1000 AND <= 2000 monthly edits - https://phabricator.wikimedia.org/T411487#11423440 (10DMburugu) [12:56:21] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11423441 (10DMburugu) [12:59:20] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for third batch of wikis: > 2000 AND <= 5000 monthly edits - https://phabricator.wikimedia.org/T411489 (10DMburugu) 03NEW [13:00:16] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11423463 (10DMburugu) [13:00:18] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for third batch of wikis: > 2000 AND <= 5000 monthly edits - https://phabricator.wikimedia.org/T411489#11423462 (10DMburugu) [13:02:49] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fourth batch of wikis: > 5000 AND <= 10000 monthly edits - https://phabricator.wikimedia.org/T411490 (10DMburugu) 03NEW [13:03:11] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fourth batch of wikis: > 5000 AND <= 10000 monthly edits - https://phabricator.wikimedia.org/T411490#11423489 (10DMburugu) [13:03:14] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11423490 (10DMburugu) [13:09:06] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fourth batch of wikis: > 10000 AND <= 30000 monthly edits - https://phabricator.wikimedia.org/T411492 (10DMburugu) 03NEW [13:09:28] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fourth batch of wikis: > 10000 AND <= 30000 monthly edits - https://phabricator.wikimedia.org/T411492#11423525 (10DMburugu) [13:09:30] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11423526 (10DMburugu) [13:12:36] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fifth batch of wikis: > 30000 AND <= 70000 monthly edits - https://phabricator.wikimedia.org/T411493 (10DMburugu) 03NEW [13:12:45] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fifth batch of wikis: > 30000 AND <= 70000 monthly edits - https://phabricator.wikimedia.org/T411493#11423552 (10DMburugu) [13:12:50] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11423553 (10DMburugu) [13:16:21] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the sixth batch of wikis: > 70000 AND <= 150000 monthly edits - https://phabricator.wikimedia.org/T411494 (10DMburugu) 03NEW [13:16:37] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the sixth batch of wikis: > 70000 AND <= 150000 monthly edits - https://phabricator.wikimedia.org/T411494#11423574 (10DMburugu) [13:16:39] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11423575 (10DMburugu) [13:19:47] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the seventh batch of wikis: > 150000 monthly edits - https://phabricator.wikimedia.org/T411495 (10DMburugu) 03NEW [13:20:03] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the seventh batch of wikis: > 150000 monthly edits - https://phabricator.wikimedia.org/T411495#11423592 (10DMburugu) [13:20:05] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11423593 (10DMburugu) [13:21:12] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the third batch of wikis: > 2000 AND <= 5000 monthly edits - https://phabricator.wikimedia.org/T411489#11423594 (10DMburugu) [13:21:33] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the third batch of wikis: > 2000 AND <= 5000 monthly edits - https://phabricator.wikimedia.org/T411489#11423595 (10DMburugu) [13:21:55] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the second batch of wikis: > 1000 AND <= 2000 monthly edits - https://phabricator.wikimedia.org/T411487#11423596 (10DMburugu) [13:22:44] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11423606 (10DMburugu) [13:23:05] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fourth batch of wikis: > 5000 AND <= 10000 monthly edits - https://phabricator.wikimedia.org/T411490#11423607 (10DMburugu) [13:23:17] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fourth batch of wikis: > 10000 AND <= 30000 monthly edits - https://phabricator.wikimedia.org/T411492#11423610 (10DMburugu) [13:23:32] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fifth batch of wikis: > 30000 AND <= 70000 monthly edits - https://phabricator.wikimedia.org/T411493#11423614 (10DMburugu) [13:23:47] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the sixth batch of wikis: > 70000 AND <= 150000 monthly edits - https://phabricator.wikimedia.org/T411494#11423619 (10DMburugu) [13:24:08] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the seventh batch of wikis: > 150000 monthly edits - https://phabricator.wikimedia.org/T411495#11423621 (10DMburugu) [13:46:58] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11423698 (10DPogorzelski-WMF) perhaps this is relevant: `journalctl -u kubelet --since "10 days ago" | grep -i "amd.com/gpu\|allocate\|device"` `Dec 02 13:38:38 ml-serve1012 kubelet[252692... [13:47:44] FIRING: LiftWingServiceErrorRate: ... [13:47:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:52:44] RESOLVED: LiftWingServiceErrorRate: ... [13:52:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:58:09] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11423748 (10DPogorzelski-WMF) The above could be false positive, might be happening when the plugin is restarted [15:06:39] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fifth batch of wikis: > 10000 AND <= 30000 monthly edits - https://phabricator.wikimedia.org/T411492#11424070 (10Samwalton9-WMF) [15:07:59] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the sixth batch of wikis: > 30000 AND <= 70000 monthly edits - https://phabricator.wikimedia.org/T411493#11424074 (10Samwalton9-WMF) [15:08:13] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the seventh batch of wikis: > 70000 AND <= 150000 monthly edits - https://phabricator.wikimedia.org/T411494#11424080 (10Samwalton9-WMF) [15:09:02] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the eigth batch of wikis: > 150000 monthly edits - https://phabricator.wikimedia.org/T411495#11424089 (10Samwalton9-WMF) 05Open→03Stalled [15:45:53] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11424287 (10elukey) I checked the Allocated resources for ml-serve1009, where we run the revise-tone-task pod on a GPU, and I see the following: ` Allocated resources: (Total limits may b... [15:47:34] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11424302 (10elukey) mmm but is allocatable something that varies dynamically? Probably not, if so everything seems working fine. Or am I missing anything? [16:06:23] 06Machine-Learning-Team, 06Abstract Wikipedia team, 10Wikifunctions, 10Wikilabels, 10WikiLambda Front-end: Discrepancy between Wikifunctions:Introduction and the actual display: "Select Function" - https://phabricator.wikimedia.org/T371487#11424398 (10DSmit-WMF) [17:24:04] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07OKR-Work: Run the backfill scripts for the fifth batch of wikis: > 10000 AND <= 30000 monthly edits - https://phabricator.wikimedia.org/T411492#11425047 (10DMburugu) [18:47:44] FIRING: LiftWingServiceErrorRate: ... [18:47:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [18:57:44] RESOLVED: LiftWingServiceErrorRate: ... [18:57:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [20:05:35] (03CR) 10Eamedina: [C:03+2] "LGTM" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1212135 (owner: 10Nik Gkountas) [20:07:06] (03Merged) 10jenkins-bot: collection recommender: split into two classes [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1212135 (owner: 10Nik Gkountas) [22:32:55] (03Abandoned) 10Sbisson: Page collection validation script [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203077 (owner: 10Sbisson)