[03:36:44] FIRING: LiftWingServiceErrorRate: ... [03:36:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [04:01:44] RESOLVED: LiftWingServiceErrorRate: ... [04:01:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:49:54] morning o/ [09:19:11] \o [09:22:12] 06Machine-Learning-Team: Patch Location headers of HTTP redirects coming from the MW API in Lift Wing services - https://phabricator.wikimedia.org/T363725#9841160 (10achou) a:03achou [09:39:05] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9841200 (10KCVelaga_WMF) @Pginer-WMF is right. For the past 90 days, here are the numbers from [[ https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/e... [10:27:10] * klausman lunch [10:34:57] 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#9841433 (10kevinbazira) In T363506#9794170, it was agreed that the logo-detection model-server should process base64 enc... [11:44:35] aiko, kevinbazira I am going to re-image ml-staging2002 tp bookworm. As a result, I'll disrupt some of the services, and it's likely some will be unable to schedule with just one node. Lmk if that would disrupt any of your work unduly. [11:48:49] (i'll start work on the hour, so you still have time to stop me :)) [11:49:09] klausman: o/ sure sure, the structured content team reached out today morning letting me know that they were going to be testing the logo-detection model-server that was deployed on staging. I'll let you know incase they face any disruptions. other than that, I have no objections. [11:49:26] Roger! [11:50:07] ack, not a problem [12:03:33] (03CR) 10AikoChou: "Hi Kevin, thanks for working on this! I ran the test locally and it worked without any issues. I have a question out of curiosity. The tes" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1035868 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira) [12:42:58] Host is reinstalled and pods are scheduling again (some are still in PodInit phase) [12:54:05] And everyhting's running again [12:54:22] kevinbazira: there shouldn't be any (further) disruption for the logo-d model [12:56:27] klausman: yep, the logo-detection service is up and running. thanks! [12:56:57] I'll have a think about reimaging the prod machines, but that is the opposite of urgent :) [12:57:39] klausman: o/ please hold off before reimaging prod hosts, there is still the question mark about what to do with dragonfly packages [12:57:44] only staging please :) [12:58:01] nono, I am not doing it this week, this month, or this quarter. [12:58:34] But I want to file it as a mid-term todo, at least unless we magically replace all the hw before I get to do so. [13:00:42] Bullseye enters "Deprecate" status in mid-2024, I definitely won't do it before then, But probably in late 2024/early 2025, unless we have a strong reason not to. [13:00:44] okok [13:01:14] The "think" bit was about how to do it and what other criteria (beyond "it works in staging") might be relevant [13:03:23] Oh, and the Supermicro machines, when they arrives will be Bookworm, just by virtue of having GPUs. Fortunately, there seems to be no issue with running a mixed cluster (also thanks to your work on proving that GPUs need nothing host-side but kernel support) [13:12:12] (03CR) 10Kevin Bazira: "Thank you for testing this, Aiko. Yes, the test logic was borrowed from what we use in the LW isvcs. I rewrote some bits for a couple of r" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1035868 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira) [13:18:18] so I think that when the dragonfly packages issue is sorted, then we can safely use Bookworm for prod [13:19:03] Alex will migrate to containerd during the next weeks IIUC, but it uses runc behind the scenes [13:19:29] already present in bookworm etc.. [13:20:02] so we should be really good, so far nothing popped out other than the kubelet partition size [13:54:15] Aye. And I had meant to do the partman bits ever since we did the manual bump, but then I (of course) forgot, and got reminded with the bookworm bump. Given that there are new machines on the horizon, I thought "now's the time" ;) [15:09:37] (03PS2) 10AikoChou: revertrisk: modify the response to dict type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) [15:10:37] (03PS3) 10AikoChou: revertrisk: modify the response to dict type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) [15:13:02] (03PS2) 10AikoChou: outlink: move test_transformer to unit test directory [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031393 [15:26:32] fols, qq - has anybody checked the errors for eswiki and viwiki that fired recently? Are they always the same, or something different? [15:27:03] the viwiki ones seem related to ores-legacy batch calls to lift wing, maybe we can do something about it. Not sure if it is all related to heavy rev-ids [15:27:14] maybe we should open a task for viwiki and investigate [15:43:23] elukey: I checked the kserve logs for eswiki, but haven't checked viwiki [15:43:42] It looks like the same issue related to heavy rev-ids [15:44:29] how do you know if it is related to ores-legacy batch calls? from dashboard? [15:46:20] exactly yes [15:46:40] I checked the istio gateway dashboard on logstash: https://logstash.wikimedia.org/goto/746c79239b2eefecc80dceef3c91b031 [15:47:10] I filtered for ml-serve codfw and response 0 (that is basically a timeout for the user in istio codes) [15:47:37] then you can also filter for viwiki-reerted in the upstream cluster breakdown [15:48:25] if you select the timewindow with most of the errors, you'll see that the user_agent is often ORES Legacy [15:49:15] but then if you want more info about the actual ores calls, you need to backtrack a little a do the same for ores-legacy [15:49:50] and you end up with something like https://logstash.wikimedia.org/goto/746c79239b2eefecc80dceef3c91b031 [15:50:14] (because we have client -> api-gateway -> ores-legacy -> viwiki-reverted) [15:51:18] sorry the last link is not correct [15:52:33] yeah they are the same link :D [15:52:43] https://logstash.wikimedia.org/goto/072094bd179f0148723e9da2ada6cecb [15:53:29] in here --^ I added a filter for ores.wikimedia.org (in the requested domain panel) and then HTTP 504 (gateway timeout) [15:53:37] you can see some viwiki batch calls in there [15:53:48] and most of them from a sigle ip [15:55:09] unsurprinsgly, it is a Vietnamese ISP that owns the IP, so probably there is some client there that used to make those calls [15:55:32] the user_agent is not something that points to a bot (we shouldn't paste in here potentially private info) [15:56:03] but at least we have an example of request that triggers the issue [15:56:15] not sure if some rev-ids are heavy [15:56:34] or if it is in general that we are not really good (At least, with reverted) to process many requests [15:56:51] aiko: does it make sense? [15:58:11] yes it makes sense! thanks Luca [15:59:48] but everytime we should follow up on the alerts, and create separate tasks if they are not the one that we expect [15:59:58] otherwise some other errors/issues may fall through the crack [16:02:53] (03PS6) 10Rockingpenny4: Adds article topic model to ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) [16:04:33] (03CR) 10CI reject: [V:04-1] Adds article topic model to ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4) [16:09:07] agree [16:40:11] checked the dashboard for eswiki fire yesterday. I don't see many records here https://logstash.wikimedia.org/app/dashboards#/view/138271f0-40ce-11ed-bb3e-0bc9ce387d88?_g=h@bfd3cfa&_a=h@928648d [19:22:34] (03CR) 10Rockingpenny4: Adds article topic model to ORES (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4) [20:32:33] (03CR) 10SD0001: Adds article topic model to ORES (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4) [20:56:18] (03CR) 10Sohom Datta: Adds article topic model to ORES (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4)