[06:57:01] 06Machine-Learning-Team: Solve revscoring models hanging isvcs for big revision sizes - https://phabricator.wikimedia.org/T366772 (10isarantopoulos) 03NEW [06:59:42] 06Machine-Learning-Team: Apply multi-processing to preprocess() in isvcs that suffer from high latency - https://phabricator.wikimedia.org/T349274#9866326 (10isarantopoulos) [06:59:43] 06Machine-Learning-Team: Solve revscoring models hanging isvcs for big revision sizes - https://phabricator.wikimedia.org/T366772#9866327 (10isarantopoulos) [07:00:13] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services - https://phabricator.wikimedia.org/T362674#9866328 (10isarantopoulos) [07:00:13] 06Machine-Learning-Team: Solve revscoring models hanging isvcs for big revision sizes - https://phabricator.wikimedia.org/T366772#9866329 (10isarantopoulos) [07:04:18] Good morning o/ [08:27:42] 06Machine-Learning-Team: Solve revscoring models hanging isvcs for big revision sizes - https://phabricator.wikimedia.org/T366772#9866466 (10isarantopoulos) [08:28:03] 06Machine-Learning-Team: Solve revscoring models increased latencies for big revision sizes - https://phabricator.wikimedia.org/T366772#9866468 (10isarantopoulos) [09:30:05] morning!!! [09:46:56] morning aiko! [10:10:47] 06Machine-Learning-Team: Apply multi-processing to preprocess() in isvcs that suffer from high latency - https://phabricator.wikimedia.org/T349274#9866899 (10isarantopoulos) [10:10:48] 06Machine-Learning-Team: Solve revscoring models increased latencies for big revision sizes - https://phabricator.wikimedia.org/T366772#9866900 (10isarantopoulos) [10:11:25] 06Machine-Learning-Team: Apply multi-processing to preprocess() in isvcs that suffer from high latency - https://phabricator.wikimedia.org/T349274#9866902 (10isarantopoulos) [10:11:26] 06Machine-Learning-Team: Solve revscoring models increased latencies for big revision sizes - https://phabricator.wikimedia.org/T366772#9866903 (10isarantopoulos) [10:46:42] I hope this will work https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1039651 [10:46:49] so that ores-legacy uses liftwing staging [10:56:14] lunch! [10:58:58] o/ I remember there is a dashboard for knative autoscaling, where we can view changes in the number of revisions. does anyone have a link? [11:04:31] \o [11:05:18] I'll reboot the eqiad worker nodes for the microcode updates in a few moments. Like with codfw, there should be minimal disruption (beyond the GPU stuff, which we don't use in prod yet) [11:08:22] aiko I'm not sure I remember which one you're referring to. Could this be it? https://grafana.wikimedia.org/d/Rvs1p4K7k/kserve?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-kubernetes_namespace_controller=kserve&var-kubernetes_namespace_queue_proxy=revertrisk&var-app=All [11:09:01] you can see the number of requests for each revision [11:19:52] 06Machine-Learning-Team: Run load tests for the rec-api-ng and update production resources to meet expected load - https://phabricator.wikimedia.org/T365554#9867039 (10kevinbazira) I ran load tests for the rec-api-ng hosted on LiftWing using the locust configurations set in the [[ https://github.com/wikimedia/re... [11:58:25] isaranto: no, but I found it! :D https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=All&from=now-24h&to=now [11:59:07] ah ok! [12:48:49] can someone review this so I can try this out? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1039651 [12:49:03] I think it should work [12:49:29] +1! [12:51:58] sorry missed that before [12:53:44] no prob, thank u! [13:16:48] also sorry, I was staring at the roll-reboot (which is now done, succesffully) [13:23:51] no need to apologize, I re-requested it when I needed it! [13:23:55] <3 [13:30:21] ok, it works as expected. So keep in mind that ores-legacy in staging is using Lift Wing staging now [13:30:37] Roger! [13:30:49] also: very late lunch for me [13:31:11] I think we should have done this from the beginning, we may just get some failing requests since some of the services won't exist but it is better for testing so that we don't interfere with prod traffic [13:42:01] isaranto: one thing that may be good to pursue - with the current config for ores-legacy in staging we don't use the local tls proxy, so testing a new version may be different than prod.. One way to keep it consistent could be to add liftwing staging in the envoy proxy config (via puppet), deploy it and use a localhost:port address [13:42:49] low priority, but it could be a small task for somebody that wants to test the workflow [13:42:54] elukey: ack. I'll open a task about it [13:46:41] or I can try to do it now :) [13:54:30] 06Machine-Learning-Team: Use local tls proxy for Lift Wing staging (inference-staging) - https://phabricator.wikimedia.org/T366801 (10isarantopoulos) 03NEW [14:00:47] isaranto: okok, so IIRC the procedure is to change envoy.yaml in puppet (look for "inference" in the repo) choosing a port for staging (basically creating a new entry) [14:01:22] then serviceops needs to +1, and after the merge you should be able to add a new "discovery" option in ores-legacy's value.yaml in deployment-charts [14:01:33] and then use the new localhost:port combination [14:10:30] ack! this is what I saw as well [15:00:56] I made an attempt https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039741 [15:01:30] iiuc I don't need an entry in profile::services_proxy::envoy::enabled_listeners: as that is only for MW installations [15:04:24] yep yep, reviewed the patch [15:08:23] grazie! [15:15:34] elukey: would I need to provide an entry for upstream: inference-staging.svc.codfw.wmnet? [15:16:02] I think so yes [15:20:29] isaranto: afaics the config should go after search.svc.codfw.wmnet, at line 274 [15:21:07] IIUC you can pick port 6203 [15:22:06] ok, thanks I figured it was the same [15:22:57] I am saying that because I saw [15:22:59] # Non-discovery records # Eqiad ports are at 61xx # Codfw ports are at 62xx [15:23:28] in theory 6231 is ok but it would leave a "hole "afaics and people may get confused [15:23:49] yes, initially I put 6231 (to have the same suffix as prod) but I kept the same pattern so I now put the next available for codfw 6205 (6203 is already taken) [15:25:01] thank you for reviewing [15:25:11] perfect, +1ed, we can wait for serviceops and then I'll merge [15:37:44] I'm making an attempt to enable mp for eswiki-damaging and viwiki-reverted https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1039776 [15:38:22] \o/ [15:38:36] If you folks agree I plan to deploy this sometime tomorrow so that it would help fight any alerts over the weekend [15:39:13] nice thought Luca, it seems to work as expected [15:39:29] the only thing that I am wondering is the maxReplicas, since it is 4 for eswiki and 8 for viwiki (with 3 CPUs each it is 4x3 and 8x3) [15:40:11] isaranto: does it let other requests flow better? Namely, I expected that the ores-legacy call with heavy revid is slow like before, but not the other concurrent requests [15:40:12] I don't think we even need the 8 maxReplicas since it is a cpu thing [15:41:09] yes, the slow requests ofc remains slow but the other requests flow as expected [15:41:10] https://phabricator.wikimedia.org/T363336#9867412 [15:41:11] maybe we could set it to say 3/4 [15:41:16] \o/ [15:42:32] \o/ [15:42:34] going afk folks, cu tomorrow o/ [15:54:04] \o/ [15:57:13] now this alleviates the issue, we still need to undestand what to do in general :( [16:32:54] yes we discussed it yesterday in the meeting, we're going to do more profiling and maybe looking at the option of cutting down the big revisions [16:35:26] ah okok! [16:35:38] there is also one option for mwparserfromhell that seems to cut down time [16:35:42] did you see it? [16:35:54] could be an easy win [17:03:47] yes we'll try that as well [23:26:44] FIRING: LiftWingServiceErrorRate: ... [23:26:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=eswiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [23:31:44] RESOLVED: LiftWingServiceErrorRate: ... [23:31:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=eswiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate