[06:45:39] Good morning o/ [08:40:44] FIRING: LiftWingServiceErrorRate: ... [08:40:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=hewiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:45:44] RESOLVED: LiftWingServiceErrorRate: ... [08:45:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=hewiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:59:31] same pattern above. Keeping a list of the models servers that fire at least one alert [09:10:46] * isaranto lunch! [09:36:25] o/ [09:36:57] Guten tag [09:56:37] hello everyone. [09:57:17] isaranto: one thing abotu those alerts though: not one of the services that alerted in the past (like vi), but he and ar instead. [10:19:59] ack [10:20:14] * klausman lunch [12:09:51] 06Machine-Learning-Team, 05Goal: 2024 Q4: Users can "pip install liftwing" and access 20% of models - https://phabricator.wikimedia.org/T359140#9903038 (10isarantopoulos) `liftwing` package version 0.1.0 has been released on PyPI - https://pypi.org/project/liftwing/ Just released version 0.1.0 of the liftwing... [12:16:14] (03PS2) 10AikoChou: articlequality: add feature preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1046720 (https://phabricator.wikimedia.org/T360455) [12:17:02] (03CR) 10AikoChou: articlequality: add feature preprocess (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1046720 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou) [12:43:06] (03CR) 10Ilias Sarantopoulos: [C:03+1] articlequality: add feature preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1046720 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou) [12:48:32] 06Machine-Learning-Team, 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "machine-learning" project Buster deprecation - https://phabricator.wikimedia.org/T367537#9903162 (10elukey) [13:09:38] Good morning all [13:19:28] Good morning o/ [13:23:04] 06Machine-Learning-Team: Reimage all ml-serve machines with Bookworm - https://phabricator.wikimedia.org/T367875 (10klausman) 03NEW [13:23:21] Morning Chris! [13:34:13] (03CR) 10Kevin Bazira: articlequality: add feature preprocess (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1046720 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou) [13:57:24] I have a surprise for the team meeting! [13:57:29] :) [13:57:31] uh oh :) [13:57:36] good one! [13:57:39] always [14:25:14] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903575 (10Jhancock.wm) [14:25:42] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903577 (10Jhancock.wm) [14:30:48] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Revert Risk models are supported by caching in production - https://phabricator.wikimedia.org/T362672#9903587 (10klausman) Current state: Three changes open/wip: - https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/995001 -- Actual co... [14:34:57] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ml-staging2003.codfw.wmnet with OS boo... [14:39:00] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9903617 (10klausman) Machine is drained and off, so you're free to reseat memory etc. Let me know when it's back (and what we might... [14:41:03] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9903626 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ebd7c06d-d85d-4a91-a22b-6101091bac81) set by klausman@c... [14:50:41] 06Machine-Learning-Team, 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "machine-learning" project Buster deprecation - https://phabricator.wikimedia.org/T367537#9903667 (10calbon) a:03klausman [14:53:37] 06Machine-Learning-Team: Update blubber version in docker images - https://phabricator.wikimedia.org/T367293#9903676 (10calbon) a:03klausman [14:54:15] 06Machine-Learning-Team: Update blubber version in docker images - https://phabricator.wikimedia.org/T367293#9903678 (10calbon) a:05klausman→03isarantopoulos [14:55:22] 06Machine-Learning-Team: Solve revscoring models increased latencies for big revision sizes - https://phabricator.wikimedia.org/T366772#9903686 (10calbon) a:03AikoChou [15:04:21] could I get a review here https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1047106? [15:04:30] danke :) [15:15:32] +1'd [15:26:05] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903816 (10klausman) It looks like the primary interface can't see the network device (the console shows "media test failure, check cable". {F55438869} [15:26:34] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9903817 (10Isaac) @kevinbazira will we have a page on the API Gateway to link to as documentation purposes [[https://api.wikimedia.org/wiki/Lift_Wing_API/R... [15:55:08] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm executed with errors... [16:05:45] added a new update for the alert earlier today - >https://phabricator.wikimedia.org/T363336#9904002 [16:06:14] going afk folks o/ [16:16:50] isaranto: thanks for checking it out! [16:17:07] have a nice evening o/ [16:53:17] update: gpu host is recalcitrant regarding netboot (for imaging) but dcops and me are working on it [16:54:18] 06Machine-Learning-Team, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q4): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756#9904249 (10elukey) I tried to add more granularity to the Thanos Rule Grafana UI, to be able to drill down each rule separately. [[ htt... [17:16:25] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9904356 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm [23:30:26] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9905564 (10Papaul) @Jhancock.wm @RobH some information on this server. **Information1** The server came with 2 network add-on cards: - 1st card connected to slot A1 is... [23:35:33] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9905575 (10Papaul)