[05:41:50] (03Abandoned) 10Santhosh: Initialize the cache on application startup [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1070481 (owner: 10Santhosh) [05:41:56] (03Abandoned) 10Santhosh: refactoring: Reorganize the code to initialize the cache [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1072169 (owner: 10Santhosh) [06:14:16] 06Machine-Learning-Team, 06SRE, 10SRE-Access-Requests, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10209446 (10santhosh) @isarantopoulos Agreed, let us recheck after two weeks. From our team perspective,... [06:52:04] 06Machine-Learning-Team, 06SRE, 10SRE-Access-Requests, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10209471 (10MoritzMuehlenhoff) @isarantopoulos Can you please ping this task once team-based permissions... [06:53:15] 06Machine-Learning-Team, 06SRE, 10SRE-Access-Requests, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10209472 (10MoritzMuehlenhoff) 05Open→03Stalled p:05Triage→03Medium [07:31:12] hello folks! [08:09:23] (03PS1) 10Ilias Sarantopoulos: langid: bump kserve to 0.13.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1078605 (https://phabricator.wikimedia.org/T367048) [09:20:21] Mroning! [09:21:13] Guten Tag o/ [09:36:19] klausman: ml-serve2001 seems down since yesterday - are you aware? [09:53:19] yes, it's likely a hw failure, I am filing a dcops bug ina moment [10:00:41] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706 (10klausman) 03NEW [11:00:40] (03PS1) 10AikoChou: locust: add reference-risk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1077310 (https://phabricator.wikimedia.org/T372405) [11:08:23] (03PS2) 10AikoChou: locust: entry for reference-risk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1077310 (https://phabricator.wikimedia.org/T372405) [11:09:46] (03PS3) 10AikoChou: locust: entry for reference-risk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1077310 (https://phabricator.wikimedia.org/T372405) [11:17:23] (03CR) 10AikoChou: locust: entry for reference-risk model (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1077310 (https://phabricator.wikimedia.org/T372405) (owner: 10AikoChou) [11:26:20] * klausman lunch [11:26:28] (03PS4) 10AikoChou: locust: entry for reference-risk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1077310 (https://phabricator.wikimedia.org/T372405) [11:28:49] (03CR) 10AikoChou: locust: entry for reference-risk model (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1077310 (https://phabricator.wikimedia.org/T372405) (owner: 10AikoChou) [11:38:22] * aiko lunch! [11:45:05] * isaranto lunch! [12:54:19] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Thanks for taking care of the comments. I left a note of a potential issue. Other than that it LGTM" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1075033 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [13:03:31] hey folks! [13:03:46] hope everything is fine in the ml world :) [13:04:11] I am reaching out for https://phabricator.wikimedia.org/T376121, I'd need to run the provision cookbook for ml-serve20[09,10,11] and ml-staging2003 [13:04:30] the cookbook reboots the host, and IIUC those are already serving prod traffic [13:05:02] I'd need to do it to apply the "canonical" config, some options may be wrong [13:05:15] so I'd need, when you have time, to get those depooled so I can run the cookbook [13:06:23] 06Machine-Learning-Team, 05Goal: ml-lab: create puppet role to install ROCm packages and make the machine accessible to people outside ML Team - https://phabricator.wikimedia.org/T376380#10210491 (10klausman) 05Open→03In progress p:05Triage→03High [13:07:10] elukey: ack! do you have any concrete timeframe that would be best? [13:07:44] klausman: anytime that you prefer, the cookbook should take 3/5 mins for each host [13:08:00] Alright, we can do that now. I'll depool the staging one first [13:08:16] (for example, /dev/kvm is available now because of virt options etc..) [13:08:20] thanks! [13:10:59] elukey: are the ml-serve1xxx Supermicro hosts unaffected? [13:11:45] klausman: probably not, but on those we don't have the redfish license yet so I cannot check :D [13:12:08] elukey@ml-serve1011:~$ ls /dev/kvm [13:12:08] /dev/kvm [13:12:11] yeah they are [13:13:23] huh, neat [13:14:05] elukey: ml-staging2003 is ready for you [13:15:00] <3 [13:15:32] I'll also start draining the serve hosts [13:21:21] of course I found a bug! [13:21:51] Lubarski's law: there is always one more bug :) [13:24:17] filed a change, turns out that the amd-related virt option wants "Enabled/Disabled", meanwhile the intel one "Enable/Disable" [13:24:31] Ah, lovely. [13:24:45] I'm available for review if you want [13:26:37] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706#10210578 (10Papaul) @klausman thank you for opening the task. Will it be possible for us to have the info on what DIMM(s) is having issues? T... [13:26:51] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706#10210561 (10Papaul) a:05Papaul→03None [13:27:33] waiting for CI to finish :) [13:36:39] ml-serve2009 is also drained. I'll wait with the other two for now [13:38:03] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706#10210612 (10klausman) These are the most recent entries from ipmi SEL: `115 | Sep-30-2024 | 01:30:13 | ECC Uncorr Err | Memory... [13:55:22] elukey: I see the machine rebooted, but no /dev/kvm [13:56:01] yep finally! [13:56:10] BIOS: BootModeSelect is set to Dual, while we want Legacy [13:56:10] BIOS: QuietBoot is set to True, while we want False [13:56:10] BIOS: SVMMode is set to Enabled, while we want Disabled [13:57:12] good to be repooled [13:57:19] klausman: ok if I proceed with 2009? [13:59:09] o/ Luca, thanks for taking care of this! [14:02:33] o/ [14:02:46] proceeding with 2009, I saw only now that it is already depooled [14:02:53] elukey: yes [14:03:33] should be up in a bit [14:06:03] klausman: we can do 2010/11 tomorrow, no rush [14:06:10] these two seems to work fine [14:10:47] :+1: I've uncordoned the 2009 now [14:12:17] thanks! [14:12:27] sadly we may have to do it for all the new ml nodes [14:12:40] So the 1xxx as well? [14:12:51] the newer ones, yes [14:13:45] ack [14:33:08] (03CR) 10Hashar: "recheck after having deployed https://gerrit.wikimedia.org/r/c/integration/config/+/1078538" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1075033 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [14:33:24] (03CR) 10CI reject: [V:04-1] article-country: initial commit [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1075033 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [15:31:17] * klausman afk [16:29:36] alright folks, I'm going afk, cu tomorrow! [19:16:40] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706#10212178 (10Jhancock.wm) looks like B1 is the problem. I do have a stick we can replace it with. we can do this first thing in the morning on...