[06:19:20] o/ Good morning! [06:57:05] hello Ilias! [08:04:23] 10Machine-Learning-Team: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3c9cc021-58b9-4756-9cf7-4880a033e42a) set by elukey@cumin1001 for 0:30:00 on 1 host(s) and their services with reason: Expand th... [08:08:00] 10Machine-Learning-Team: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) After encountering issues with CI fetching files from the Wikimedia public datasets archive (T341582) and CI post-merge build failure due to an internal server error (T342084). F... [08:16:03] \o morning [08:16:48] elukey: something occurred to me: the only constant in the latency warnings we see is exec_sync on 1001. The one machine where we expaned the kubelet partition. maybe we'll also see it in codfw. Or it's just coincidence. [08:17:19] morning :) [08:17:22] elukey: did you reboot the machine when you expanded the part on 1001? [08:17:32] nope [08:19:13] Alright. I'll keep an eye on the possible correlation [08:19:35] If it looks like the two are related, I may try a reboot of 1001. [08:20:43] it is weird since in https://phabricator.wikimedia.org/T343900 I thought I had found the issue [08:21:04] I don't see anymore ExecSync errors in the kubelet log [08:21:55] I'll see of I can find something that explains the latency. [08:22:05] Will update bug accordingly [08:22:53] another thing to keep in mind - I expanded the kubelet partition on ml-serve2001, but of course it had to be online [08:23:09] since even if we stop the kubelet, most of our pods do use its partition to store the models [08:23:21] so a clean upgrade requires a drain [08:25:07] aye [08:26:10] 10Machine-Learning-Team: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 (10elukey) Just did ml-serve2001, but of course the resize had to be online since most of our pods store data on the kubelet partition (like the model binary). We should drain the nodes first as... [08:31:17] on ml-serve1001 the kubelet is not logging since the 9th, but it seems running ok [08:31:35] restarted the kubelet [08:32:48] The part increase was way before the 9th, right? [08:34:45] yes yes [08:34:47] ahyes, Jun 15th. So at least the logging stopping wasn't caused by the part increase [08:35:09] nono I deployed an increased limit for the pod mentioned in the task [08:35:16] since it was causing the execsync log etc.. [08:35:27] Wonder if we should monitor the base logging rate of the kubelets (and possibly other services). either veyr high or 0 are likely indication of problems [08:36:09] _some_ daemons are very quiet beyond startup. But the kubelet is rather chatty in normal operation. [08:36:11] ok so the latency dropped [08:36:26] we needed a restart [08:36:45] So drain, part increase, kubelet restart, back into service? [08:37:28] drain, kubelet stop, part increase kubelet start back into service [08:37:49] ack. [08:38:14] but it is low priority, you can focus on all rest, we don't really need now that amount of storage [08:38:19] it is more for LLM [08:38:26] aye. just a filler. [08:40:27] 10Machine-Learning-Team, 10sre-alert-triage: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10elukey) 05Open→03Resolved a:03elukey After the kubelet restart the metric cleared! [08:49:58] 10Machine-Learning-Team: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) [08:50:11] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10kevinbazira) 05Open→03Resolved a:03kevinbazira [08:51:20] filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/948088/ to trim some throttling in calico pods [08:53:57] LGTM [09:05:20] deployed to codfw, throttling way better for both pods [09:05:27] doing it also for eqiad [09:08:49] ooh, massive improvement [09:08:56] >500ms -> 11ms [09:09:15] s/500/250/ [09:11:21] Slightly less dramatic in eqiad, but still factor 4+. [09:11:41] ok, nvm, dropping even further :D [09:14:49] looks good now, it was very high before [09:15:41] klausman: I was wondering if we could use HTTP caching at the CDN for our isvcs, but I think we can't [09:15:54] we pass everything via post, so caching URLs is not really an option [09:16:09] it is a pity since we loose a big caching layer [09:16:17] Ah, and I presume the CDN sees all POSTs as mutating and therefore won't cache them? [09:16:26] Yeah I think so [09:16:41] I can ask but I am pretty sure the CDN doesn't cache variations in the POST [09:16:52] I wonder if kserve will ever support GET/URL query requests, but I doubt it. [09:17:03] me too [09:17:31] Which _then_ makes me wonder if the API GW or a homebrew service could act as a translator. [09:17:58] But that is very far-future thinking [09:18:03] probably yes, but we should have thought about it earlier :) [09:18:28] in some cases passing POST payloads in the URL may be tricky [09:18:29] the only thing you can be sure of in these things is that you'll miss _something :) [09:18:56] but this may bite us heavily when we'll ramp up traffic [09:18:58] sigh [09:19:11] anyway, we should fallback to the cassandra score cache [09:19:43] The upside of doing that is that we're more in control of cache keying and invalidation [09:20:21] (the downside of "in control" is "more work", but you can't have everything) [09:31:55] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10Sgs) >>! In T308136#9084142, @Trizek-WMF wrote: > @Sgs, I have the same results: > * [[ https://lo.wikipedia.org/w/index.php?search=h... [10:29:47] 10Machine-Learning-Team: Caching strategies for scores in Lift Wing - https://phabricator.wikimedia.org/T344051 (10elukey) [10:31:14] created some notes for what we discussed --^ [10:33:23] Will give it a read and add comments if I have any after lunch [10:33:31] thanks! [10:39:40] (03PS2) 10AikoChou: revert-risk: upgrade knowledge_integrity to v0.3.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947875 (https://phabricator.wikimedia.org/T340813) [10:39:40] 10Machine-Learning-Team: Caching strategies for scores in Lift Wing - https://phabricator.wikimedia.org/T344051 (10elukey) Varnish can cache POST requests: https://docs.varnish-software.com/tutorials/caching-post-requests/ This is probably worth to follow up, but it may be a bad/expensive idea for our CDN. [10:45:22] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947875 (https://phabricator.wikimedia.org/T340813) (owner: 10AikoChou) [10:46:56] * elukey lunch! [10:50:30] 10Machine-Learning-Team: Discuss potential migration - https://phabricator.wikimedia.org/T344010 (10Aklapper) [10:50:42] 10Machine-Learning-Team: Discuss potential migration - https://phabricator.wikimedia.org/T344010 (10Aklapper) [10:50:44] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10Aklapper) [10:50:57] 10Machine-Learning-Team: Discuss potential migration - https://phabricator.wikimedia.org/T344010 (10Aklapper) [Please semantically connect tasks - thanks!] [10:51:48] (03Merged) 10jenkins-bot: revert-risk: upgrade knowledge_integrity to v0.3.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/947875 (https://phabricator.wikimedia.org/T340813) (owner: 10AikoChou) [11:33:57] 10Machine-Learning-Team: Caching strategies for scores in Lift Wing - https://phabricator.wikimedia.org/T344051 (10klausman) Upsides of local-ish to LW caching (e.g. Cassandra): We have control over decide what is cached how: - maximum rev age that is cached - which models are cached - how much cache-side space... [12:28:51] * klausman running an errand, bbiab [14:07:50] 10Machine-Learning-Team: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 (10isarantopoulos) @Ladsgroup I agree. IIUC you mean to change the config from this ` 'arwiki' => [ 'damaging' => [ 'likelygood' => [ 'min' => 0, 'max' => 'maximum recall @ p... [14:09:15] 10Machine-Learning-Team: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 (10Ladsgroup) Nope, that's exactly what I'm advocating ^_^ [14:10:17] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Malayalam-Sites, 10User-notice: Deploy "add a link" to 12th round of wikis - https://phabricator.wikimedia.org/T308137 (10Sgs) Status update, as per today all wikis have produced results except for `nawiki`. For more context: - `m... [14:17:25] 10Machine-Learning-Team: Tune LiftWing autoscaling settings for Knative - https://phabricator.wikimedia.org/T344058 (10elukey) [14:20:45] folks anybody around for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/948143 ? [14:21:05] I have discovered a problem in our config, and I'd like to test this before going on holidays [14:21:16] (available to explain what I think it is happening) [14:25:27] 10Machine-Learning-Team, 10Patch-For-Review: Tune LiftWing autoscaling settings for Knative - https://phabricator.wikimedia.org/T344058 (10elukey) > If both a soft and a hard limit are specified, the smaller of the two values will be used. This prevents the Autoscaler from having a target value that is not per... [14:26:14] (I'll test it now) [14:30:55] looking [14:31:20] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Malayalam-Sites, and 2 others: Deploy "add a link" to 12th round of wikis - https://phabricator.wikimedia.org/T308137 (10Sgs) [14:31:47] LGTM'd [14:32:04] elukey: are the effects as expected? [14:33:42] it takes time to judge, IIUC autoscaling.knative.dev/target: X is the soft limit, and it tells to the autoscaler when it is time to spin up a new pod [14:34:09] but there is also https://knative.dev/docs/serving/autoscaling/concurrency/#target-utilization [14:34:18] that is 70% by default afaics [14:35:11] for drafttopic we had 3, so it meant that 2 concurrent clients were sufficient to ask for a new pod [14:35:29] and the knative metrics suggest that we keep spinning up and terminating pods, due to changeprop [14:35:33] ah, the "warmup" setting, gitcha [14:35:39] gotcha* [14:35:47] so raising to 5 seems good enough, it should flat out the creation of pods [14:38:29] ack [14:38:31] ideally we should have flat lines in https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=All&from=now-6h&to=now&viewPanel=24 [14:39:22] klausman: be aware of the containerConcurrency setting, it shouldn't be picked now since we specify a lower autoscaling.knative.dev/target [14:40:07] but IIUC, if it is picked up, it probably forces the activator to buffer requests (in proxy mode) [14:40:20] what is the default for cC? [14:40:29] it is not set [14:48:09] ack [15:05:08] so far no autoscaling actions, seems working [15:08:07] elukey: iiuc this could be the reason for the issue that we were seeing (deployments scaling up and down all the time) (?) [15:09:48] isaranto: for draft topic? I think so yes.. I am watching https://grafana-rw.wikimedia.org/d/c6GYmqdnz/knative-serving?forceLogin&from=now-6h&orgId=1&to=now&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=revscoring-drafttopic and it looks way better [15:11:10] ack [15:16:22] 10Machine-Learning-Team: Tune LiftWing autoscaling settings for Knative - https://phabricator.wikimedia.org/T344058 (10elukey) Judging from [[ https://grafana-rw.wikimedia.org/d/c6GYmqdnz/knative-serving?forceLogin&from=now-6h&orgId=1&to=now&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-knative_namespace=knat... [15:25:26] going afk folks! [15:25:32] talk with you in two weeks! [15:25:35] o/ o/ o/ [15:30:32] \o enjoy! [15:31:46] Bye Luca! [15:32:22] Ciao Luca!! o/ [15:39:16] ciao luca! \o