[06:39:49] (03PS2) 10Kevin Bazira: Makefile: fix articlequality local-run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059298 (https://phabricator.wikimedia.org/T371677) [06:48:06] (03CR) 10Kevin Bazira: Makefile: fix articlequality local-run (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059298 (https://phabricator.wikimedia.org/T371677) (owner: 10Kevin Bazira) [07:31:19] hey folks! [07:31:34] ml-serve2001 is having troubles since the past two days [07:33:22] a lot of OEM errors, plus "Multi-bit memory errors detected on a memory device at location(s) DIMM_B2." [07:33:38] powercycling it [07:34:51] seems again something like https://phabricator.wikimedia.org/T313822 [07:59:58] ty for the powercycle. And yes, it's been doing that for a while :-/ [08:03:25] I vaguely remember there being a way to check its warranty status, trying to dig that up now [08:07:06] Mh, Looks, like the link from ther Phab ticket doesn't work (anymore). I'll poke DCOps about it [08:16:57] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872 (10klausman) 03NEW [08:17:40] elukey: I am a bit surprised there is not alerting for errors like that. Or am I missing something that I should have looked out for? [08:32:13] klausman: I noticed calico pods not running on the node, but we have also host down alerts IIRC [08:35:49] How did you miss the missing pods? [08:36:11] er, how did you *notice* them [08:40:40] 04:58 + FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:40:45] plus alerts.w.o [08:40:58] they were firing yesterday too IIRC [08:42:17] werid that I didn't notice the alerts.w.o alert. I have that in a pinned tab. Probably Okta logging me out or sth [08:58:26] klausman: are the ml-cache nodes used? I'd need to upgrade and roll restart java on those [08:58:37] 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.17; 2024-08-06), 07OKR-Work: Deploy Modernized Recommendation API to LiftWing - https://phabricator.wikimedia.org/T371465#10044355 (10kevinbazira) Thank you for the update @santhosh. As we were testing the modernized recommendation-api endpoints to make su... [08:58:53] elukey: go ahead, not currently in use [08:59:18] okok [10:03:19] * klausman lunch [11:53:20] (03PS3) 10AikoChou: readability: updates according to the new TRank model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059032 (https://phabricator.wikimedia.org/T369712) [11:54:48] (03CR) 10AikoChou: readability: updates according to the new TRank model (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059032 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [11:54:56] o/ [12:05:35] (03CR) 10AikoChou: [C:03+1] "Solved" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059298 (https://phabricator.wikimedia.org/T371677) (owner: 10Kevin Bazira) [12:36:08] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059298 (https://phabricator.wikimedia.org/T371677) (owner: 10Kevin Bazira) [12:36:53] (03Merged) 10jenkins-bot: Makefile: fix articlequality local-run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059298 (https://phabricator.wikimedia.org/T371677) (owner: 10Kevin Bazira) [13:05:43] Morning all [13:06:19] morning Chris! [13:37:41] (03CR) 10Kevin Bazira: "Thank you for working on this Aiko. Have you encountered this dependency conflict: https://phabricator.wikimedia.org/P67233" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059032 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [14:45:59] 10Lift-Wing, 06Machine-Learning-Team: Request to host article-country model on Lift Wing - https://phabricator.wikimedia.org/T371897 (10Isaac) 03NEW [14:48:56] 10Lift-Wing, 06Machine-Learning-Team: Request to host article-country model on Lift Wing - https://phabricator.wikimedia.org/T371897#10045352 (10Isaac) [15:14:08] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872#10045443 (10Papaul) a:05Papaul→03None [15:16:15] 10Lift-Wing, 06Machine-Learning-Team: Request to host Reference Quality Model on Lift Wing - https://phabricator.wikimedia.org/T371902 (10XiaoXiao-WMF) 03NEW [15:39:54] (03CR) 10AikoChou: "No, I didn't experience this issue when testing it locally with docker. How did you build the image 509ae6d52e92? I would suggest building" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059032 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [17:21:07] (03CR) 10Kevin Bazira: [C:03+1] "I rebuilt the image and the model-server was able to run locally after updating the model path to `*/20240805140437/model.bin` as shown in" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059032 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [18:06:17] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920 (10RobH) 03NEW [18:06:40] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920#10046349 (10RobH) [18:09:02] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920#10046361 (10RobH) [18:09:03] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920#10046352 (10RobH) a:05Jhancock.wm→03klausman >>! In T366521#10045581, @Jhancock.wm wrote: > these servers are racked. and I'll have them all pingable on the mgmt network in... [18:09:42] chrisalbon: ^^^ that means I can start setting up the codfw GPU hosts tomorrow, probably [21:11:13] sweet