[07:05:03] good morning o/ [08:13:10] Morning! [08:13:38] \o [08:14:46] I'll do the rolling reboots of the codfw workers today. Aside from the GPU host, there should not be any disruption since we have enough capacity [08:17:56] ack. thank you! [08:45:57] 06Machine-Learning-Team, 06DC-Ops: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670 (10klausman) 03NEW [08:47:08] Looks like 2001 is having memory issues. I'll file a ticket with dcops and take it out of service once the roll-reboot of codfw is done [08:47:24] 06Machine-Learning-Team, 06DC-Ops: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9862654 (10klausman) {F54950041} {F54950046} [08:48:02] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services - https://phabricator.wikimedia.org/T362674#9862658 (10isarantopoulos) - Identified that the latency issues are caused by revscoring preprocessing code when scoring large... [08:59:15] 06Machine-Learning-Team, 06DC-Ops: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9862709 (10klausman) Can't upload the ASR since it's too large. Anywhere that I should upload it to? [09:07:01] FYI, ml-etcd1003 is ticked off at https://phabricator.wikimedia.org/T366555, but it's actually still running 5.10.191, might have been a typo? [09:07:07] (03CR) 10Ilias Sarantopoulos: locust: use multiple payloads for load testing (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira) [09:18:50] moritzm: yeah, my bd, I rebooted 1002 twice :) will deal with it today [09:23:22] 06Machine-Learning-Team, 06DC-Ops: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9862766 (10klausman) {F54950327} {F54950326} [09:25:16] ack, thx [09:37:20] hello - I'm running sre.dns.netbox and ml-staging2001 is being set `profile::netbox::host::status: failed` [09:37:24] is that okay to proceed with? [09:38:14] yes [09:38:25] thanks [09:38:37] (I think, I did that because wikitech tolfd me so) [09:38:53] The host has dropped a DIMM, so I am taking it out of service. [10:19:09] (03CR) 10Ilias Sarantopoulos: [C:03+1] revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [10:19:25] I'm ready to merge the above patch and try it out on staging [10:19:34] after lunch that is [10:19:36] * isaranto lunch [10:20:22] (03PS3) 10Kevin Bazira: locust: use multiple payloads for load testing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) [10:24:12] (03CR) 10Kevin Bazira: locust: use multiple payloads for load testing (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira) [10:32:24] 06Machine-Learning-Team, 06DC-Ops: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688 (10klausman) 03NEW [10:32:47] So we're now sorta-down (reduced RAM) two machines in codfw. [10:41:43] I'll try and expedite a fix with dcops [10:41:46] * klausman lunch [10:54:01] (03CR) 10Elukey: "Left a couple of comments for the more recent version, but overall I think we can proceed!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [11:11:36] (03PS5) 10Ilias Sarantopoulos: revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [11:17:45] (03CR) 10Ilias Sarantopoulos: revscoring_model: inspect mw-api-cache for MP preprocess (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [11:34:14] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [12:33:37] (03CR) 10Elukey: [C:03+1] "I left a comment to turn debug logging to info, but +1 when you are done!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [12:34:28] isaranto: o/ I left a small comment for the heavy rev-id logging, but proceed afterwards! [12:34:42] I know that I proposed debug in the first place, but I think info is way better for our use case [12:34:48] less hassle etc.. [12:35:01] especially during the next weeks that we'll have to monitor how things goes [12:35:07] thanks a lot for working on it! [12:35:59] no worries it is ok! it makes sense for it to be info [12:36:37] kevinbazira: o/ have you tested the patch for rec api load tests from an internal host (e.g. statbox)? [12:37:01] yes, I have tested it on stat1008 [12:38:32] (03PS6) 10Ilias Sarantopoulos: revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [12:38:51] ok, thanks! [12:43:39] (03CR) 10Ilias Sarantopoulos: "LGTM! Nice work" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira) [12:44:45] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira) [12:45:59] (03Merged) 10jenkins-bot: locust: use multiple payloads for load testing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira) [12:48:05] (03CR) 10Ilias Sarantopoulos: [C:03+2] "Let's give this a try!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [12:48:48] (03Merged) 10jenkins-bot: revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey) [13:15:24] I updated the patch with the new image for the mw-api-cache change https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1038765 [13:18:02] +1! [13:18:18] once deployed we can use the ores-legacy batch request stated in the task to check [13:18:34] ah no it is staging [13:18:48] do we need two replicas in staging isaranto ? [13:19:49] elukey: my bad was leftover from before. we just need 3 cpus to start with [13:22:48] perfect :) [13:41:06] https://github.com/kserve/kserve/releases/tag/v0.13.0 :O [13:41:39] released 2 minutes ago! so fresh :) [13:42:09] there are two CVEs listed, in theory they shouldn't affect us but better to double check [13:42:43] also https://github.com/kserve/kserve/pull/3258 is another thing to check :( [13:44:10] and also https://github.com/kserve/kserve/pull/3362 [13:44:11] uff [13:45:19] and also 3 more CVEs [13:47:18] o/ I'm going to deploy new revertrisk images to prod. the one for RRLA supports batch requests [13:51:52] Oh that istio change looks hairy [13:56:10] a lot of changes indeed [13:58:15] Of the CVEs, the one I think most likely to be relevant is the http2 excessive headers one. [13:58:44] It's DoS, not breakin, so at least there's that [14:10:58] we shouldn't be receiving http2 conns from the api-gateway but better to check [14:47:23] moritzm: can you confirm ml-etcd1003 is now in the correct state? [14:49:41] confirmed! [14:49:58] https://debmonitor.wikimedia.org/kernels/1_1-smp-debian-510218-1-2024-06-01 prints all systems with the fixed kernel, BTW [14:51:29] merci! [16:03:27] o/ I'm going afk for the evening, have a nice rest of day/evening! [16:03:27] I owe to update/create the tasks we discussed and will do it first thing in the morning. Also regarding the multiprocessing I plan to test it with an instance of ores-legacy that uses liftwing-staging (will write it on the task as well) [16:15:57] bye Ilias! [16:26:17] 06Machine-Learning-Team: Deploy RR-language-agnostic batch version to prod - https://phabricator.wikimedia.org/T358744#9864478 (10achou) The new revertrisk images have been deployed to production. Next steps: - Update [[ https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_reverted_risk_language_agnostic... [16:34:11] isaranto: o/ for tomorrow - I think that the mp changes will likely not cause any benefit in the ores-legacy calls (doing batches and/or containing at least one heavy rev-id) but it should allow other requests to be served [16:34:25] hopefully this will give some relief to other clients [16:35:31] * elukey afk as well o/ [16:42:00] 06Machine-Learning-Team: Test Revert Risk model with the transparent config - https://phabricator.wikimedia.org/T366250#9864620 (10achou) Update: I tested the Revert Risk models with the transparent config in staging. It worked without any issues. Notably, it seems that the transparent config somehow increases...