[03:36:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[03:36:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[04:01:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[04:01:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[08:49:54] <aiko>	 morning o/
[09:19:11] <klausman>	 \o
[09:22:12] <wikibugs>	 06Machine-Learning-Team: Patch Location headers of HTTP redirects coming from the MW API in Lift Wing services - https://phabricator.wikimedia.org/T363725#9841160 (10achou) a:03achou
[09:39:05] <wikibugs>	 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9841200 (10KCVelaga_WMF) @Pginer-WMF is right.  For the past 90 days, here are the numbers from [[ https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/e...
[10:27:10] * klausman lunch
[10:34:57] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#9841433 (10kevinbazira) In T363506#9794170, it was agreed that the logo-detection model-server should process base64 enc...
[11:44:35] <klausman>	 aiko, kevinbazira I am going to re-image ml-staging2002 tp bookworm. As a result, I'll disrupt some of the services, and it's likely some will be unable to schedule with just one node. Lmk if that would disrupt any of your work unduly. 
[11:48:49] <klausman>	 (i'll start work on the hour, so you still have time to stop me :))
[11:49:09] <kevinbazira>	 klausman: o/ sure sure, the structured content team reached out today morning letting me know that they were going to be testing the logo-detection model-server that was deployed on staging. I'll let you know incase they face any disruptions. other than that, I have no objections.
[11:49:26] <klausman>	 Roger!
[11:50:07] <aiko>	 ack, not a problem 
[12:03:33] <wikibugs>	 (03CR) 10AikoChou: "Hi Kevin, thanks for working on this! I ran the test locally and it worked without any issues. I have a question out of curiosity. The tes" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1035868 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira)
[12:42:58] <klausman>	 Host is reinstalled and pods are scheduling again (some are still in PodInit phase)
[12:54:05] <klausman>	 And everyhting's running again
[12:54:22] <klausman>	 kevinbazira: there shouldn't be any (further) disruption for the logo-d model
[12:56:27] <kevinbazira>	 klausman: yep, the logo-detection service is up and running. thanks!
[12:56:57] <klausman>	 I'll have a think about reimaging the prod machines, but that is the opposite of urgent :)
[12:57:39] <elukey>	 klausman: o/ please hold off before reimaging prod hosts, there is still the question mark about what to do with dragonfly packages
[12:57:44] <elukey>	 only staging please :)
[12:58:01] <klausman>	 nono, I am not doing it this week, this month, or this quarter.
[12:58:34] <klausman>	 But I want to file it as a mid-term todo, at least unless we magically replace all the hw before I get to do so.
[13:00:42] <klausman>	 Bullseye enters "Deprecate" status in mid-2024, I definitely won't do it before then, But probably in late 2024/early 2025, unless we have a strong reason not to.
[13:00:44] <elukey>	 okok
[13:01:14] <klausman>	 The "think" bit was about how to do it and what other criteria (beyond "it works in staging") might be relevant
[13:03:23] <klausman>	 Oh, and the Supermicro machines, when they arrives will be Bookworm, just by virtue of having GPUs. Fortunately, there seems to be no issue with running a mixed cluster (also thanks to your work on proving that GPUs need nothing host-side but kernel support)
[13:12:12] <wikibugs>	 (03CR) 10Kevin Bazira: "Thank you for testing this, Aiko. Yes, the test logic was borrowed from what we use in the LW isvcs. I rewrote some bits for a couple of r" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1035868 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira)
[13:18:18] <elukey>	 so I think that when the dragonfly packages issue is sorted, then we can safely use Bookworm for prod
[13:19:03] <elukey>	 Alex will migrate to containerd during the next weeks IIUC, but it uses runc behind the scenes
[13:19:29] <elukey>	 already present in bookworm etc..
[13:20:02] <elukey>	 so we should be really good, so far nothing popped out other than the kubelet partition size
[13:54:15] <klausman>	 Aye. And I had meant to do the partman bits ever since we did the manual bump, but then I (of course) forgot, and got reminded with the bookworm bump. Given that there are new machines on the horizon, I thought "now's the time" ;)
[15:09:37] <wikibugs>	 (03PS2) 10AikoChou: revertrisk: modify the response to dict type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744)
[15:10:37] <wikibugs>	 (03PS3) 10AikoChou: revertrisk: modify the response to dict type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744)
[15:13:02] <wikibugs>	 (03PS2) 10AikoChou: outlink: move test_transformer to unit test directory [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031393
[15:26:32] <elukey>	 fols, qq - has anybody checked the errors for eswiki and viwiki that fired recently? Are they always the same, or something different?
[15:27:03] <elukey>	 the viwiki ones seem related to ores-legacy batch calls to lift wing, maybe we can do something about it. Not sure if it is all related to heavy rev-ids
[15:27:14] <elukey>	 maybe we should open a task for viwiki and investigate
[15:43:23] <aiko>	 elukey: I checked the kserve logs for eswiki, but haven't checked viwiki
[15:43:42] <aiko>	 It looks like the same issue related to heavy rev-ids
[15:44:29] <aiko>	 how do you know if it is related to ores-legacy batch calls? from dashboard?
[15:46:20] <elukey>	 exactly yes
[15:46:40] <elukey>	 I checked the istio gateway dashboard on logstash: https://logstash.wikimedia.org/goto/746c79239b2eefecc80dceef3c91b031
[15:47:10] <elukey>	 I filtered for ml-serve codfw and response 0 (that is basically a timeout for the user in istio codes)
[15:47:37] <elukey>	 then you can also filter for viwiki-reerted in the upstream cluster breakdown
[15:48:25] <elukey>	 if you select the timewindow with most of the errors, you'll see that the user_agent is often ORES Legacy
[15:49:15] <elukey>	 but then if you want more info about the actual ores calls, you need to backtrack a little a do the same for ores-legacy
[15:49:50] <elukey>	 and you end up with something like https://logstash.wikimedia.org/goto/746c79239b2eefecc80dceef3c91b031
[15:50:14] <elukey>	 (because we have client -> api-gateway -> ores-legacy -> viwiki-reverted)
[15:51:18] <elukey>	 sorry the last link is not correct
[15:52:33] <aiko>	 yeah they are the same link :D
[15:52:43] <elukey>	 https://logstash.wikimedia.org/goto/072094bd179f0148723e9da2ada6cecb
[15:53:29] <elukey>	 in here --^ I added a filter for ores.wikimedia.org (in the requested domain panel) and then HTTP 504 (gateway timeout)
[15:53:37] <elukey>	 you can see some viwiki batch calls in there
[15:53:48] <elukey>	 and most of them from a sigle ip
[15:55:09] <elukey>	 unsurprinsgly, it is a Vietnamese ISP that owns the IP, so probably there is some client there that used to make those calls
[15:55:32] <elukey>	 the user_agent is not something that points to a bot (we shouldn't paste in here potentially private info)
[15:56:03] <elukey>	 but at least we have an example of request that triggers the issue
[15:56:15] <elukey>	 not sure if some rev-ids are heavy 
[15:56:34] <elukey>	 or if it is in general that we are not really good (At least, with reverted) to process many requests
[15:56:51] <elukey>	 aiko: does it make sense?
[15:58:11] <aiko>	 yes it makes sense! thanks Luca
[15:59:48] <elukey>	 but everytime we should follow up on the alerts, and create separate  tasks if they are not the one that we expect
[15:59:58] <elukey>	 otherwise some other errors/issues may fall through the crack
[16:02:53] <wikibugs>	 (03PS6) 10Rockingpenny4: Adds article topic model to ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132)
[16:04:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Adds article topic model to ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4)
[16:09:07] <aiko>	 agree
[16:40:11] <aiko>	 checked the dashboard for eswiki fire yesterday. I don't see many records here https://logstash.wikimedia.org/app/dashboards#/view/138271f0-40ce-11ed-bb3e-0bc9ce387d88?_g=h@bfd3cfa&_a=h@928648d
[19:22:34] <wikibugs>	 (03CR) 10Rockingpenny4: Adds article topic model to ORES (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4)
[20:32:33] <wikibugs>	 (03CR) 10SD0001: Adds article topic model to ORES (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4)
[20:56:18] <wikibugs>	 (03CR) 10Sohom Datta: Adds article topic model to ORES (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4)