[08:09:00] Guten Tag Leute o/ [09:53:36] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10isarantopoulos) As part of the Lift Wing expansion we are on path to have GPUs installed in the next... [10:37:04] Morning \o [10:37:39] isaranto: I have just pushed a temp change to ml-staging, dropping the version #s from the apigroups as we discussed. When you have some time, can you see if deleting revisions now works for you? [10:37:54] hey Tobias! on it now [10:38:07] ty! [10:39:43] it works \o/ [10:39:49] a-ha! [10:39:55] I'll make a patch [10:39:55] I deleted the revertirsk isvc [10:41:59] And I presume it automatically came back? [10:42:30] nope I synced [10:42:34] ah, right [10:42:36] I'm doing it again [10:43:47] if I delete the pod it will come back cause it is defined in the isvc. iirc the isvc is the top level object that defines everything [10:44:35] right, brian fart on my end [10:44:42] er. brain. Poor brian. [10:49:10] The download failure is being tracked in T356792. It seems like we know what config causes it (it's a 65s timeout on the connection). Not sure yet what the fix will be. [10:54:10] * klausman lunch [10:56:47] klausman: for when you're back: I'd roll out the runc updates to the remaining ML clusters? no need to stagger these further from my PoV, all other k8s clusters have been updated without issues at this point [11:20:52] yes, go ahead [11:22:09] moritzm: ^^^ [11:22:36] on it [11:24:14] all updated [11:29:15] (03PS5) 10Ilias Sarantopoulos: WIP - locust: add article_descriptions load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995039 [11:32:32] (03PS6) 10Ilias Sarantopoulos: locust: add article_descriptions load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995039 (https://phabricator.wikimedia.org/T353952) [11:33:01] (03PS7) 10Ilias Sarantopoulos: locust: add article_descriptions load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995039 (https://phabricator.wikimedia.org/T353952) [11:43:33] TIL https://github.com/abiosoft/colima [12:15:11] * isaranto afk lunch! [13:10:25] (03CR) 10Kevin Bazira: "Following instructions from `/test/locust/README.md`, I ran:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995039 (https://phabricator.wikimedia.org/T353952) (owner: 10Ilias Sarantopoulos) [13:49:51] isaranto: kevinbazira: any objectiosn to me doing a rolling drain/undrain cycle on serve-eqiad for the runc update? [13:52:49] klausman: go ahead! [13:57:10] (03CR) 10Ilias Sarantopoulos: "That is the expected behavior. The specified argument (`results/article_descriptions`) is the filename (actually the prefix) for the resul" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995039 (https://phabricator.wikimedia.org/T353952) (owner: 10Ilias Sarantopoulos) [13:59:47] 10Machine-Learning-Team, 10SRE, 10Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10klausman) After dropping the version specifiers (`/v...`) at the end of the `apiGroups` directives, this is now working properly. [14:14:44] klausman: o/ no objection from me either :) [14:15:28] 10Machine-Learning-Team: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw - https://phabricator.wikimedia.org/T356867 (10klausman) a:03klausman [14:32:09] ll done for eqiad. I will do codfw tomorrow [14:32:40] 10Machine-Learning-Team: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw - https://phabricator.wikimedia.org/T356867 (10klausman) [14:35:50] 10Machine-Learning-Team: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw - https://phabricator.wikimedia.org/T356867 (10klausman) [14:40:26] (03CR) 10Kevin Bazira: [C: 03+1] "Great! Thank you for the clarification." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995039 (https://phabricator.wikimedia.org/T353952) (owner: 10Ilias Sarantopoulos) [14:42:57] 10Machine-Learning-Team: Downtime ml-cache2001 for network link move - https://phabricator.wikimedia.org/T356873 (10klausman) [14:43:17] 10Machine-Learning-Team: Downtime ml-cache2001 for network link move - https://phabricator.wikimedia.org/T356873 (10klausman) [14:45:04] 10Machine-Learning-Team: Downtime ml-cache2001 for network link move - https://phabricator.wikimedia.org/T356873 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3787bbda-c63c-44e0-8670-cc787f18b203) set by klausman@cumin2002 for 3:00:00 on 1 host(s) and their services with reason: Machine ne... [14:45:29] 10Machine-Learning-Team: Downtime ml-cache2001 for network link move - https://phabricator.wikimedia.org/T356873 (10klausman) 05Open→03Resolved Downtime has been added. [16:19:09] logging off folks! have a nice rest of day/evening [16:43:36] \o