[05:37:38] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10kevinbazira) Model evaluation has been completed and below are the backtesting results: | | Precision@0.5 | Recall@0.5 |lawiki | 0.89 | 0.47 |ladwik... [06:36:07] 10artificial-intelligence, 10WMF-Inspiration-Week-2022-ML-Collab: Deploy Image content filtration model for Wikimedia Commons - https://phabricator.wikimedia.org/T279416 (10Metin6201) [07:48:02] hello folks [08:02:56] Hey Luca! [08:08:15] Kalimera :) [08:08:26] I am prepping to upgrade ml-serve-eqiad to 1.23 [08:10:06] 10artificial-intelligence, 10WMF-Inspiration-Week-2022-ML-Collab: Deploy Image content filtration model for Wikimedia Commons - https://phabricator.wikimedia.org/T279416 (10Peachey88) [08:11:39] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd1003.eqiad.wmnet with OS bullseye [08:11:53] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd1002.eqiad.wmnet with OS bullseye [08:12:01] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd1001.eqiad.wmnet with OS bullseye [08:12:31] ok so etcd nodes under reimage [08:40:25] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd1001.eqiad.wmnet with OS bullseye completed: - ml-etcd1001 (**PASS**)... [08:41:42] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd1002.eqiad.wmnet with OS bullseye completed: - ml-etcd1002 (**PASS**)... [08:41:54] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd1003.eqiad.wmnet with OS bullseye completed: - ml-etcd1003 (**PASS**)... [08:46:01] etcd done, reimaging the cluster (root tmux session on cumin1001 called T330758) [08:50:57] * elukey relocating [09:21:22] isaranto: o/ [09:21:27] very interesting data https://w.wiki/6Pbp [09:22:30] so WME calls us :D [09:22:37] something that I didn't expect.. [09:24:18] https://w.wiki/6Pc3 [09:24:21] o/ thanks for the link,great graph [09:24:43] they call us using the /v3/scores endpoint, not good (it is the one returning multiple scores at once IIRC) [09:25:11] I think that we should open a task to them, asking what is their use case and if they would be ok in using/testing LiftWing when ready [09:25:54] do you want to do it or should I? [09:32:12] https://github.com/wikimedia/OKAPI/blob/master/service/streams/revisionscore/revisionscore.go [09:32:16] * elukey cries in a corner [09:32:23] so they seem also to use revision-score [09:33:59] ok we need to figure out what they are doing, otherwise we can't deprecate [09:35:40] Lemme check it and I'll do it.thanks for the heads up [09:36:35] * isaranto afk for 30' commuting to co-working [09:37:04] 10Machine-Learning-Team: Review ORES traffic to better understand Lift Wing's requirements - https://phabricator.wikimedia.org/T325763 (10elukey) https://w.wiki/6Pbp shows the UAs calling the ores.wikimedia.org endpoint (so all external clients). We have also Wikimedia Enterprise, that is a surprise, apparently... [09:38:43] klausman: o/ https://www.kubeflow.org/docs/components/pipelines/v1/sdk/manipulate-resources/#persistent-volume-claims-pvcs sigh [09:38:49] not sure if it can be optional or not [09:56:49] \o [09:58:47] you mean whether allowing PVs is ooptional or not? [09:58:57] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve1002.eqiad.wmnet with OS bullseye [09:59:27] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve1003.eqiad.wmnet with OS bullseye [09:59:59] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve1004.eqiad.wmnet with OS bullseye [10:00:11] klausman: yes correct [10:01:17] my gut tells me it would be fine until something requests such a volume, and then fail in mysterious ways [10:02:33] but maybe there could be a permission/acl that makes it more obvious. or a 0-sized pseudo storage pool [10:03:09] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve1005.eqiad.wmnet with OS bullseye [10:03:40] klausman: we should try to experiment with kubeflow sooner rather than later, to have some ideas about Train Wing [10:05:45] ok so I kicked off the reimages of ml-serve100[1-5] [10:06:00] the extra caveat today is that hosts in row E/F have issues with reimages [10:06:06] (like we found with DSE)] [10:06:19] ack [10:06:22] so ml-serve1005 is the first of the E/F nodes, let's see how it goes [10:06:35] are 6-8 in those rows? [10:06:59] yep [10:07:28] * isaranto is back [10:08:38] https://netbox.wikimedia.org/search/?q=ml-serve1&obj_type= [10:08:40] klausman: --^ [10:08:57] the usual symptom is that the first puppet run fails [10:10:11] I see [10:10:20] Here's hoping [10:10:44] My left arm is extra spicy today for some reason, but I'll try to keep up [10:20:11] klausman: is there a task/wikitech-page/etc.. with instruction about how to use lift wing from api.wikimedia.org? (Like bearer token etc..?) [10:21:15] nothing that's quite ready. I have been chipping away at it. [10:22:10] Overall, it's not super complicated. The bearer token stuff is in the base APIGW docs, and after that it's just a matter of constructing the query, e.g. curl -s "https://api.wikimedia.org/service/lw/inference/v1/models/enwiki-articletopic:predict" -X POST -d '{ "rev_id": 123555 }' -H "Authorization: Bearer $ACCESSTOKEN" [10:23:47] okok [10:24:06] I opened a task to the api platform folks to add documentation in api.w.o [10:24:21] so hopefully we can add docs in there soon-ish [10:24:34] The bulk of what would have to be on https://api.wikimedia.org/wiki/API_reference is what we have on wikitech already, edited for external users. [10:25:34] I meant something like https://api.wikimedia.org/wiki/API_reference/Service/Link_recommendation/Get_link_recommendations [10:26:31] Yeah, I was trying to write at least one example like that, but it fell off the todo pile :( [10:27:32] sure sure, I will try to write some docs when they give us the green light to reach the MVP state [10:27:44] so we can think about the soft launch [10:27:48] thank you <3 [10:33:36] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve1004.eqiad.wmnet with OS bullseye completed: - ml-serve1004 (**WARN**)... [10:33:52] unrelatedly, I came across this yesterday: https://github.com/jabbalaci/go-jsonpath very neat [10:34:50] has a link to a python version at the bottom (the link I pasted is the Go versoin) [10:35:35] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve1002.eqiad.wmnet with OS bullseye completed: - ml-serve1002 (**PASS**)... [10:37:21] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve1003.eqiad.wmnet with OS bullseye completed: - ml-serve1003 (**PASS**)... [10:42:52] klausman: if you have time, can you sanity check https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/892996 for me plis? [10:43:05] it should be ok but better safe [10:43:16] we have 4 nodes now so I can run admin_ng stuff [10:43:30] checking [10:44:02] lgtm [10:44:36] thanks! [11:04:08] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve1005.eqiad.wmnet with OS bullseye completed: - ml-serve1005 (**PASS**)... [11:04:39] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve1006.eqiad.wmnet with OS bullseye [11:05:01] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve1007.eqiad.wmnet with OS bullseye [11:05:15] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve1008.eqiad.wmnet with OS bullseye [11:06:22] klausman: 1005 reimaged fine, I kicked off the rest [11:06:50] Ack [11:09:40] ok so the cluster is up again with 1.23, the model servers are missing [11:14:34] created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/893417/ so we can tune the kserve's controller instances [11:16:31] one comment :) [11:18:58] 10Machine-Learning-Team, 10ORES, 10Wikimedia Enterprise: ores.wikimedia.org endpoint deprecation - https://phabricator.wikimedia.org/T330854 (10isarantopoulos) [11:22:41] elukey: I opened the above task. feel free to add/change anything [11:32:55] super thanks! [11:35:47] isaranto: I realized that the ORES dashboard for logstash catches only uncached traffic [11:36:47] so this is why we didn't see most of the bots [11:37:43] okok now the picture seems more complete [11:37:46] I'll dig more into it [11:37:47] * elukey lunch! [11:38:10] klausman: FYI I left 3 reimages in progress, I hope to find them finished when I am back :D [11:38:44] ack [11:51:26] I'll keep an eye out for alerts [12:33:45] * isaranto lunch [13:51:44] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve1007.eqiad.wmnet with OS bullseye executed with errors: - ml-serve1007 (**FAIL**) - Downti... [13:52:21] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve1007.eqiad.wmnet with OS bullseye [13:56:12] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve1007.eqiad.wmnet with OS bullseye executed with errors: - ml-serve1007 (**FAIL**) - Remove... [13:57:25] elukey: what happened with 1007? [14:02:35] klausman: so all nodes failed to get to d-i, of course I retried 1007 and realized it was switchover time, so I cancelled [14:02:39] best timing ever [14:03:09] oooh, I had not realized they were stuck. I should've checked on idrac [14:30:33] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve1006.eqiad.wmnet with OS bullseye executed with errors: - ml-serve1006 (**FAIL**) - Downti... [14:30:40] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve1008.eqiad.wmnet with OS bullseye executed with errors: - ml-serve1008 (**FAIL**) - Downti... [14:33:34] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin2002 for host ml-serve1006.eqiad.wmnet with OS bullseye [15:09:12] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin2002 for host ml-serve1006.eqiad.wmnet with OS bullseye executed with errors: - ml-serve1006 (**FAIL**) - Remove... [15:09:22] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin2002 for host ml-serve1006.eqiad.wmnet with OS bullseye [15:55:21] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10isarantopoulos) [15:55:24] 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10isarantopoulos) 05Open→03In progress [16:27:53] 10Machine-Learning-Team, 10Edit-Review-Improvements-Integrated-Filters, 10Growth-Team, 10Research: Integration of Revert Risk Scores to Recent Changes as a filter - https://phabricator.wikimedia.org/T329071 (10Ladsgroup) Hi, my sincere apologies for late answer, we are understaffed even more than the usual... [17:23:11] klausman: so I was able to make ml-serve1006 to PXE boot, I had to flip a setting in the BIOS to priviledge one NIC over the other one when booting [17:23:37] it didn't work for a couple of times, then Arzhel started to check tcpdump logs for DHCP on install1004 and it PXE-booted [17:23:51] not sure what kind of planet-alignment it was waiting for [17:24:05] anyway, tomorrow I'll continue with 1007/1008 [17:24:22] very weird [17:32:19] going afk for today, have a nice rest of the day folks! [17:37:10] Weird stuff indeed. [20:21:21] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Northern Luri Wikipedia model training pipeline failed - https://phabricator.wikimedia.org/T330616 (10kostajh) 05Open→03Invalid [20:21:24] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10kostajh) [20:27:45] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice-archive: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 (10kostajh)