[07:05:03] <isaranto>	 good morning o/
[08:13:10] <klausman>	 Morning!
[08:13:38] <isaranto>	 \o
[08:14:46] <klausman>	 I'll do the rolling reboots of the codfw workers today. Aside from the GPU host, there should not be any disruption since we have enough capacity
[08:17:56] <isaranto>	 ack. thank you!
[08:45:57] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670 (10klausman) 03NEW
[08:47:08] <klausman>	 Looks like 2001 is having memory issues. I'll file a ticket with dcops and take it out of service once the roll-reboot of codfw is done
[08:47:24] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9862654 (10klausman) {F54950041} {F54950046}
[08:48:02] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services - https://phabricator.wikimedia.org/T362674#9862658 (10isarantopoulos) - Identified that the latency issues are caused by revscoring preprocessing code when scoring large...
[08:59:15] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9862709 (10klausman) Can't upload the ASR since it's too large. Anywhere that I should upload it to?
[09:07:01] <moritzm>	 FYI, ml-etcd1003 is ticked off at https://phabricator.wikimedia.org/T366555, but it's actually still running 5.10.191, might have been a typo?
[09:07:07] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: locust: use multiple payloads for load testing (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira)
[09:18:50] <klausman>	 moritzm: yeah, my bd, I rebooted 1002 twice :) will deal with it today
[09:23:22] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9862766 (10klausman) {F54950327}  {F54950326}
[09:25:16] <moritzm>	 ack, thx
[09:37:20] <hnowlan>	 hello - I'm running sre.dns.netbox and ml-staging2001 is being set `profile::netbox::host::status: failed`
[09:37:24] <hnowlan>	 is that okay to proceed with? 
[09:38:14] <klausman>	 yes
[09:38:25] <hnowlan>	 thanks
[09:38:37] <klausman>	 (I think, I did that because wikitech tolfd me so)
[09:38:53] <klausman>	 The host has dropped a DIMM, so I am taking it out of service.
[10:19:09] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[10:19:25] <isaranto>	 I'm ready to merge the above patch and try it out on staging
[10:19:34] <isaranto>	 after lunch that is
[10:19:36] * isaranto lunch
[10:20:22] <wikibugs>	 (03PS3) 10Kevin Bazira: locust: use multiple payloads for load testing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554)
[10:24:12] <wikibugs>	 (03CR) 10Kevin Bazira: locust: use multiple payloads for load testing (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira)
[10:32:24] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688 (10klausman) 03NEW
[10:32:47] <klausman>	 So we're now sorta-down (reduced RAM) two machines in codfw.
[10:41:43] <klausman>	 I'll try and expedite a fix with dcops
[10:41:46] * klausman lunch
[10:54:01] <wikibugs>	 (03CR) 10Elukey: "Left a couple of comments for the more recent version, but overall I think we can proceed!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[11:11:36] <wikibugs>	 (03PS5) 10Ilias Sarantopoulos: revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[11:17:45] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: revscoring_model: inspect mw-api-cache for MP preprocess (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[11:34:14] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[12:33:37] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "I left a comment to turn debug logging to info, but +1 when you are done!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[12:34:28] <elukey>	 isaranto: o/ I left a small comment for the heavy rev-id logging, but proceed afterwards!
[12:34:42] <elukey>	 I know that I proposed debug in the first place, but I think info is way better for our use case
[12:34:48] <elukey>	 less hassle etc..
[12:35:01] <elukey>	 especially during the next weeks that we'll have to monitor how things goes 
[12:35:07] <elukey>	 thanks a lot for working on it!
[12:35:59] <isaranto>	 no worries it is ok! it makes sense for it to be info
[12:36:37] <isaranto>	 kevinbazira: o/ have you tested the patch for rec api load tests from an internal host (e.g. statbox)?
[12:37:01] <kevinbazira>	 yes, I have tested it on stat1008
[12:38:32] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[12:38:51] <isaranto>	 ok, thanks!
[12:43:39] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "LGTM! Nice work" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira)
[12:44:45] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira)
[12:45:59] <wikibugs>	 (03Merged) 10jenkins-bot: locust: use multiple payloads for load testing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira)
[12:48:05] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] "Let's give this a try!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[12:48:48] <wikibugs>	 (03Merged) 10jenkins-bot: revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[13:15:24] <isaranto>	 I updated the patch with the new image for the mw-api-cache change https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1038765
[13:18:02] <elukey>	 +1!
[13:18:18] <elukey>	 once deployed we can use the ores-legacy batch request stated in the task to check
[13:18:34] <elukey>	 ah no it is staging
[13:18:48] <elukey>	 do we need two replicas in staging isaranto ?
[13:19:49] <isaranto>	 elukey: my bad was leftover from before. we just need 3 cpus to start with
[13:22:48] <elukey>	 perfect :)
[13:41:06] <elukey>	 https://github.com/kserve/kserve/releases/tag/v0.13.0 :O
[13:41:39] <isaranto>	 released 2 minutes ago! so fresh :)
[13:42:09] <elukey>	 there are two CVEs listed, in theory they shouldn't affect us but better to double check
[13:42:43] <elukey>	 also https://github.com/kserve/kserve/pull/3258 is another thing to check :(
[13:44:10] <elukey>	 and also https://github.com/kserve/kserve/pull/3362
[13:44:11] <elukey>	 uff
[13:45:19] <elukey>	 and also 3 more CVEs
[13:47:18] <aiko>	 o/ I'm going to deploy new revertrisk images to prod. the one for RRLA supports batch requests
[13:51:52] <klausman>	 Oh that istio change looks hairy
[13:56:10] <isaranto>	 a lot of changes indeed
[13:58:15] <klausman>	 Of the CVEs, the one I think most likely to be relevant is the http2 excessive headers one.
[13:58:44] <klausman>	 It's DoS, not breakin, so at least there's that
[14:10:58] <elukey>	 we shouldn't be receiving http2 conns from the api-gateway but better to check
[14:47:23] <klausman>	 moritzm: can you confirm ml-etcd1003 is now in the correct state?
[14:49:41] <moritzm>	 confirmed!
[14:49:58] <moritzm>	 https://debmonitor.wikimedia.org/kernels/1_1-smp-debian-510218-1-2024-06-01 prints all systems with the fixed kernel, BTW
[14:51:29] <klausman>	 merci!
[16:03:27] <isaranto>	 o/ I'm going afk for the evening, have a nice rest of day/evening!
[16:03:27] <isaranto>	 I owe to update/create the tasks we discussed and will do it first thing in the morning. Also regarding the multiprocessing I plan to test it with an instance of ores-legacy that uses liftwing-staging (will write it on the task as well)
[16:15:57] <aiko>	 bye Ilias! 
[16:26:17] <wikibugs>	 06Machine-Learning-Team: Deploy RR-language-agnostic batch version to prod - https://phabricator.wikimedia.org/T358744#9864478 (10achou) The new revertrisk images have been deployed to production.   Next steps: - Update [[ https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_reverted_risk_language_agnostic...
[16:34:11] <elukey>	 isaranto: o/ for tomorrow - I think that the mp changes will likely not cause any benefit in the ores-legacy calls (doing batches and/or containing at least one heavy rev-id) but it should allow other requests to be served
[16:34:25] <elukey>	 hopefully this will give some relief to other clients
[16:35:31] * elukey afk as well o/
[16:42:00] <wikibugs>	 06Machine-Learning-Team: Test Revert Risk model with the transparent config - https://phabricator.wikimedia.org/T366250#9864620 (10achou) Update:  I tested the Revert Risk models with the transparent config in staging. It worked without any issues. Notably, it seems that the transparent config somehow increases...