[08:00:21] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=74cbf082-10f9-46b9-9315-de46465fbfba) set by elukey@cumin1001 for 8:00:00 on 8 host(s) a... [08:43:54] 10Machine-Learning-Team, 10Growth-Team, 10PageTriage: Detection and flagging of articles that are AI/LLM-generated - https://phabricator.wikimedia.org/T330346 (10kostajh) Tagging #machine-learning-team for awareness, or maybe they have something like this on their roadmap? [09:10:51] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [09:11:17] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye [09:11:45] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye [09:12:58] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye [09:13:13] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [09:13:29] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye [09:13:57] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [09:15:44] https://kserve.github.io/website/developer/debug/#debug-kserve-request-flow [09:15:50] this is really nice --^ [09:16:01] it summarize the request flow between istio/knative/kserve [09:26:52] Niiice,thanks for sharing [09:27:41] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed wi... [09:45:01] aiko: \o/ [09:46:35] good that u mentioned the gerrit patch for the AIX explainer , hadnt noticed it [09:49:20] however it is much simpler to run it as far as I understand because the LIME method is integrated in kserve [09:50:14] I am debugging this here https://github.com/kserve/kserve/blob/master/docs/samples/explanation/aix/Fetch_20newsgroups/README.md [09:50:14] cause it doesn't play well out of the box (because of scikit-learn incompatibilities) so I could try with one of our models instead [09:51:25] sorry for not updating the ticket.. I will upload new instructions to run kserve locally with minikube as some of the official instructions doesn't work anymore with 0.10 [10:00:07] 10Lift-Wing, 10Machine-Learning-Team: Support the Revert-Review API/tool on Toolforge - https://phabricator.wikimedia.org/T330148 (10diego) @achou the main requestor will be the aforementioned API, for evaluating the the model. I don't expect high traffic. Let's say a couple of thousands per week. [10:07:41] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye completed:... [10:10:40] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye completed:... [10:12:28] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye completed:... [10:19:10] isaranto: o/ [10:19:41] my plan is to try if I can simply run the explainer in the gerrit patch with docker locally (similar to https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe#Example_1_-_Testing_enwiki-goodfaith). [10:20:08] and see if I can reproduce the result in [10:20:09] https://phabricator.wikimedia.org/T301378#7698176, which was tested on ml-sandbox. [10:23:47] I was checking the ml-sandbox today and found the minikube on ml-sandbox was gone. don't know why. I tried to install it back but couldn't because of no space left on device.. [10:24:23] Last time I checked, minikube was what too the most space [10:24:59] IIRC it was also due to the docker images [10:25:10] /srv/docker/volumes/minikube is ~35G [10:25:37] yeah but minikube itself holds its docker + images IIRC [10:25:50] if you minikube ssh + docker image list you should see it [10:27:03] maybe we need to wipe the docker stuff and do a refetch to get it to work? But I dunno if cleaning out all that would destroy anything we can't easily get back [10:27:04] tried minikube ssh but seems the docker daemon is not running [10:27:37] I restarted it [10:28:01] But with /srv being full, it likely won't be able to do much [10:28:58] docker-registry.wikimedia.org/buster 20220123 bca4b63acefd 13 months ago 69.3MB [10:29:00] Ouch. [10:29:17] oh wait, MB, not GB. Nvm :) [10:29:27] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [10:33:47] tried minikube ssh again but apparently there is no minikube, so we need to run minikube start [10:34:27] I can't run minikube because it can't even set up its main dir because ENOSPC [10:35:51] same [10:37:15] elukey: 23G /srv/docker/volumes/minikube/_data/lib/docker/containers/e92a0b634d25d9cf60590a61900cd13bb14bd61e29e9ec458a603dba6ed279b3/e92a0b634d25d9cf60590a61900cd13bb14bd61e29e9ec458a603dba6ed279b3-json.log [10:37:34] ^^^ Something tells me this log is useless, mostly. Do you think it's safe to truncate it? [10:38:13] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10kostajh) >>! In T308135#8639664, @kevinbazira wrote: > @kostajh, we published datasets for all 17/19 models that passed the evaluation in this round.... [10:38:52] klausman: +1 yes [10:39:23] alright, will do that [10:44:19] thanks 🙏 [10:44:57] aiko: try again, please [10:47:34] I made a compressed copy of the file to my laptop, just in case [10:48:40] yes I can run minikube start now [10:52:27] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye executed wi... [10:52:54] excellent! [10:54:21] docker images in minikube didn't occupy much space, only the revertrisk one uses ~2GB and I'm gonna delete it [10:55:59] ack, thank you. [10:56:19] I think the disk-fillup was just one minikube running for a long time with no log rotation [10:59:57] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed wi... [11:00:01] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed wi... [11:02:22] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye executed wi... [11:03:26] There likely isn't much interesting info in the log file: I compressed it and it went from 22.8G to 722.6M [11:08:39] got it! I think it would be nice to note the reason of the disk-fillup somewhere for future reference.. maybe in https://phabricator.wikimedia.org/T305447 [11:19:52] 10Machine-Learning-Team: Automate the procedure to bootstrap minikube on the ML-Sandbox and to share it by multiple users - https://phabricator.wikimedia.org/T305447 (10klausman) The current (now resolved) reason for the disk fillup was a 22G logfile: `/srv/docker/volumes/minikube/_data/lib/docker/containers/e9... [11:33:05] * elukey lunch [11:37:45] isaranto: o/ I'll try to run the explainer you linked to on ml-sandbox [11:38:19] cool, feel free to try it either way [11:41:04] for the moment I am getting a `requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: http://127.0.0.1:80/v1/models/aix-explainer:explain` for the explain call while predict works fine [11:54:32] * klausman lunch [11:58:02] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10Sgs) I ran this script for adding the link-recommendation task type and and populating the excluded sections: `lang=bash PHAB=T304551... [11:58:11] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10Sgs) [11:58:51] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10Sgs) 05Open→03In progress [11:59:38] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10Sgs) 05Open→03In progress [11:59:55] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Sgs) 05Open→03In progress [12:08:19] * isaranto lunch [12:14:58] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10Sgs) I ran this script for adding the link-recommendation task type and and populating the excluded sections: `lang=bash PHAB=T304551... [12:15:04] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10Sgs) [12:39:38] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Sgs) I ran this script for adding the link-recommendation task type and and populating the excluded sections: `lang=bash PHAB=T308134... [12:44:58] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Sgs) [14:10:25] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye [14:18:01] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [14:22:06] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye [14:23:13] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [14:23:55] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w... [14:31:31] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w... [14:33:10] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) Full exception: ` Exception raised while executing cookbook sre.hosts.reimage: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/spicerack/_me... [14:36:25] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w... [14:49:35] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) 05Open→03Stalled There is a problem/bug triggered while reimaging nodes in row E/F in eqiad, tracked in T306421. Until it is fixed we cannot really complete the re... [14:50:12] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b1fe3d5f-2fa2-4c9e-92d5-c7b84f294e1e) set by elukey@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their servi... [14:50:40] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w... [15:20:59] so with the switch to FastAPI there is a SwaggerUI built in https://kserve.github.io/website/0.10/get_started/swagger_ui/ \o/ [15:21:54] all one needs is an argument in the predictor in the InferenceService ```args: ["--enable_docs_url=True"]``` [15:22:22] 10Machine-Learning-Team: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) ` { "method": "POST", "authority": "enwiki-goodfaith-predictor-default.revscoring-editquality-goodfaith.svc.cluster.local", "downstream_local_address": "10.194.61.225:8443", "u... [16:05:59] 10Lift-Wing, 10Machine-Learning-Team: Investigate Explainer for Revert-Risk model - https://phabricator.wikimedia.org/T330131 (10achou) @isarantopoulos I was able to run the kserve [[ https://github.com/kserve/kserve/tree/master/docs/samples/explanation/aix/Fetch_20newsgroups | AIX explainer example ]] on ml-s... [16:14:57] aiko: nice work! [16:15:43] which kserve version did u have? the same stuff fail on kserve 0.10 for me [16:16:12] btw I created this guide to setup kserve with minikube https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe/DeployLocal [16:16:18] and run an inference service [16:17:51] it has stuff from the kserve documentation and bits and pieces from here and there. main difference is that the official documentation doesnt work as I needed `networking.k8s.io/v1` instead of `networking.k8s.io/v1beta1` for the IngressClass (thank u Pycharm for the hint) [16:19:46] my future plan would be to have a LiftWing model instead of sklearn with iris dataset [16:20:19] also couldnt get the API docs (swagger) to work yet 🤷 [16:24:07] Heading out now. Enjoy your weekends, folks (and enjoy the extra day off, Ilias!) [16:31:02] going afk folks, have a good weekend! [16:50:20] Thanks, enjoy your weekend ,I'm heading out too!