[06:58:03] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) @kostajh, we published datasets for all 12/15 models in this round that passed the evaluation. [07:02:41] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) [08:09:05] good morning :) [08:36:49] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10elukey) @isarantopoulos nice test! Do we see CPU throttling when the test runs? For example, I don't see anything in https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod... [09:20:09] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) @elukey you are right. I put it as boolean, but `true` in yaml is translated to `True` in python and the comparison is actually comparing strings so `True=="Tru... [09:40:14] isaranto: new pods up! [09:41:11] U rock, thanks! [09:43:24] There should be an explicit logging for the process pool [09:43:35] (if it is enabled) [09:45:44] elukey: I do see it `Create a process pool of 5 workers to support model scoring blocking code`. however it states [09:45:44] ``` [09:45:44] [I 221129 09:40:04 model_server:125] Will fork 1 workers [09:45:44] [I 221129 09:40:04 model_server:128] Setting max asyncio worker threads as 9 [09:45:44] ``` [09:46:40] isaranto: ah yes yes, lemme explain - the asyncio worker threads are created by kserve, and asyncio does the same if not instructed elsewhere (with a different number of workers). [09:47:25] basically asyncio offers by defaults some threads that you can use to offload "blocking" IO calls (like a HTTP call made via requests) [09:48:03] IIUC this helps avoid blocking the main loop thread, and in theory it should allow asyncio to poll the thread pool in a more async way [09:48:18] https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor has some info as well [09:49:01] I am not 100% clear about the details, but the thread pool has the problem of not being able to cope with cpu-bound code (since it stalls everything until the cpu computation is finished) [09:49:58] meanwhile the process pool is handled differently - new python processes are created, and the asyncio lib behind the scenes uses pickle to serialize/deserialize functions+parameters to execute on other processes [09:55:34] thanks for clarifying. I will rerun the same test now [10:17:47] klausman: o/ [10:17:57] \o [10:18:11] I found something interesting today, while checking the alarms for k8s api latencies (they are for all clusters) [10:18:24] at some point in the knative controller logs (and in others as well, all knative related) I see [10:18:27] dial tcp: lookup kubernetes.default.svc.cluster.local: Temporary failure in name resolution [10:18:45] and the timings align with the increase in 504 registered [10:18:47] :fry squint: DNS. Again? [10:19:01] I restarted the kube-api servers in eqiad and everything cleared out [10:19:09] nah I think it is a knative bug [10:19:45] Hmm. So this DNS resolution is happening against the k8s DNS resolvers, right? [10:20:02] in theory yes [10:20:10] or do you think it's a knative-interal resolver that falls over somehow? [10:21:38] nono it shouldn't have an internal resolver [10:21:42] https://grafana-rw.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?orgId=1&var-dc=codfw%20prometheus%2Fk8s-mlserve&from=now-2d&to=now looks weird [10:21:49] it should be when Ilias deployed [10:22:45] yesterday evening? [10:23:01] Or this morning? [10:23:37]  [10:23:38] https://sal.toolforge.org/production?p=0&q=isaranto&d= [10:23:56] weird it was way before [10:24:08] is SAL in UTC? [10:24:28] yes [10:24:58] well, you linked the prod cluster, not staging. [10:25:15] and yesterday's pushes in SAL are for staging [10:25:31] klausman: yes in fact I mentioned "weird it was way before" -> namely on the 24th [10:25:42] so it doesn't match [10:25:49] aaaah [10:26:06] well, the staging curves also don't quite line up with pushes, for that matter [10:26:42] after restarting kube-api on ml-serve-codfw the latencies recovered as well [10:27:01] only staging is left [10:27:10] maybe we can use it to see if we can find anything useful [10:28:16] I need to go out for an errand, definitely weird [10:28:23] I didn't expect that drop in DNS queries [10:28:24] mmmm [10:28:29] In the 7d graphs of latencies for serv-codfw, do you see an increase over the day yesterday? Or am I imagining things? [10:28:48] kube api latencies? Can you link the graph? [10:29:00] https://grafana.wikimedia.org/goto/eyU7SMK4k?orgId=1 [10:29:59] Before midnight on the 29th, it's a lot more volatile, but with lower minimum latencies. Afterwards, it's more steady but higher overall [10:30:24] yep [10:30:27] no idea [10:30:46] * elukey afk for a bit [11:21:53] back! [11:27:19] o/ [11:36:16] klausman: ok a 30d view looks less weird https://grafana.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?from=now-30d&orgId=1&to=now&var-dc=codfw%20prometheus%2Fk8s-mlserve [11:36:39] I have no idea why we have that variation of dns requests, but the drop is not something completely unexpected [11:37:18] Latency is pretty consistent over that timeframe as well [11:37:33] But yeah, the # of requests is weird [11:37:50] we'll see, let's keep an eye on it [11:38:00] my bet is that the knative controllers are full of bugs [11:38:07] 0.18.1 was released 2y ago [11:41:49] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) Re-run the test and edited the previous message. Much better results, and it seems that latency doesn't increase over time as it happens in the non MP version.... [11:44:47] isaranto: \o/ [11:45:10] elukey: /o [11:45:50] heading out for lunch! [11:46:04] same here [12:08:26] * klausman as well [13:19:44] aiko: o/ [13:20:00] about the 1000 request test - was it done pyspark -> Lift Wing -> MW API? [13:20:21] I can try to inspect logs if you re-run it, so we can see if anything weird pops up [13:34:09] elukey: yeah it was pyspark -> Lift Wing -> MW API [13:34:35] ok I'm gonna rerun it [13:38:47] klausman: I was reading https://knative.dev/docs/serving/services/custom-domains, that seems interesting [13:39:13] IIUC with new knative versions we could avoid the long isvc-name.namespace.wikimedia.org [13:39:21] in theory, omitting the namespace etc.. [13:39:44] Hmm, interesting. That would make the API GW rewrites a little simpler, too [13:39:52] knative ships one extra controller and a webhook for these things, seems a little overkill but [13:40:38] In an ideal world, the API GW would auto-discover the services on LW and add relevant routing info. [13:40:48] An SRE can dream :) [13:41:04] yeah sure :D [13:41:25] let's keep the current config for api-gw but keep it in mind that after the upgrade we could in theory simplify [13:41:31] Ack. [13:42:05] I'm also trying to figure out if we could make the config a bit more DRY-y, but it's not at the forefront of the effort [13:43:15] elukey: I just re-ran it and it was hitting inference-staging endpoint [13:43:49] elukey: 44 missing responses this time [13:50:51] aiko: what isvc? [13:51:44] enwiki-goodfaith ok [13:51:52] revert-risk-model [13:52:21] experimental namespace [13:52:29] ah snap I got sidetracked by a previous test attempt sorry (just opened the istio dashboard) [13:53:40] Morning all! [13:55:02] o/ [13:55:10] aiko: checked the istio-proxy logs, all 200s afaics [13:57:50] elukey: yeah that's the weird part.. mwapi return 200 but the response was something like {'batchcomplete': True, 'query': {'pages': [{'pageid': 1102369, 'missing': True}]}} if you check kserve logs [13:59:20] ok so in the last batch I can see two queries [13:59:21] /w/api.php?action=query&formatversion=2&prop=revisions&pageids=60790710&rvlimit=1&rvdir=newer&rvslots=main&format=json [13:59:31] "/w/api.php?action=query&formatversion=2&prop=revisions&pageids=1102369&rvlimit=1&rvdir=newer&rvslots=main&format=json" [13:59:36] are those right? [13:59:40] as we expect them I mean [14:01:11] aiko: --^ [14:02:05] yeah that's right [14:03:14] two rev_ids (es) 142965340 -> pageid is 1102369, and (en) 1099209076 -> pageid is 60790710 [14:04:27] not only pageid, we would also query their parent rev_id [14:05:27] aiko: what about rvlimit etc..? [14:05:34] is mwapi setting different ones? [14:06:23] just trying to find differences [14:06:54] the query was written here https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/revision.py#L208 [14:07:18] not sure the purpose of rvlimit [14:08:05] aiko: and when you tested pyspark -> mwapi did you add the same parameters? [14:08:20] a bit (or a lot) offtopic: What is the retention on kafka topics? and has anyone figured out how we could do checkpointing with benthos?or even start from specific offset [14:09:12] isaranto: IIRC we have size based + time based retention (so if the size exceeds we drop automatically, otherwise we keep it for a week I think, lemme check) [14:09:13] elukey: yeah I used the same parameters [14:09:47] elukey: cool, thanks! I don't need anything specific, was just curious [14:11:04] isaranto: 168 hours :) [14:11:27] for benthos, I believe that if you start it as consumer group it stops/starts from the same offset (the last committed) [14:11:46] ack [14:11:49] checkpointing is probably something more advanced, I heard it for Flink but not benthos [14:16:46] aiko: (I keep asking trivial questions please stop me if I say something silly) - do we know if these pages were deleted previously by any chance? [14:16:56] I wonder if they have anything special [14:17:14] but yeah I cannot repro as well via curl [14:20:10] There are no trivial questions, only questions [14:21:02] chrisalbon: I am struggling to keep up with Aiko's speed so I come up with excuses :D [14:24:22] aiko: one weird thing - If I query the following I get only missing: true [14:24:23] https://api-ro.discovery.wmnet/w/api.php?action=query&formatversion=2&prop=revisions&pageids=1102369&rvlimit=1&rvdir=newer&rvslots=main&format=json [14:24:34] and this is one of the two pageid queries [14:24:46] from stat1004 [14:25:27] even https://en.wikipedia.org/w/api.php?action=query&formatversion=2&prop=revisions&pageids=1102369&rvlimit=1&rvdir=newer&rvslots=main&format=json [14:26:57] yeah https://en.wikipedia.org/w/index.php?curid=1102369 [14:32:39] elukey: no no they're not trivial questions [14:33:27] elukey: ah pageid 1102369 is from eswiki, you were querying enwiki [14:34:32] https://es.wikipedia.org/w/index.php?curid=1102369 [14:35:50] ahh [14:36:03] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) And various test with wrk 1 minute - 1 connection - 1 thread ` isaranto@deploy1002:~/scripts$ wrk -c 1 -t 1 --timeout 2s -s inference.lua https://inference-sta... [14:39:57] elukey: the weird thing is out of 1000 requests, 44 responses missing. so there were still 956 requests fetching data correctly for these two pages. [14:40:49] aiko: the even weirder thing is that via curl I don't see any glitch even with tons of requests [14:41:05] and the response is too mw-api specific, it is not raised by istio etc.. [14:43:08] aiko: and you also wrote a script in python using the knowledge integrity package, calling get_page say 500 times ? [14:43:08] elukey: yeah it seems to only happen with pyspark [14:43:44] pyspark -> lift wing [14:52:01] elukey: yep I've tried that and that's no problem [16:19:13] isaranto: o/ we are working on Flink stuff as part of Event Platform : https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream if you are interested we'd love to talk sometime :) [16:33:55] ottomata: /o nice to meet u :) the team has told me about the ongoing effort with Flink, sounds super cool. we could def have a chat (even to get to know each other) [16:36:13] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) When deploying the updated calico chart to my test cluster I realized that the fixed dependency to the CRD chart does not mean tha... [16:44:01] isaranto: cool! i put short meeting on our calendars for monday, feel free to move it around :) [16:44:47] ottomata: looks good, cu then! [17:25:02] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) CC @BTullis & @bking: This might be relevant for operators as well. [17:33:19] * elukey afk! [17:33:22] o/ [17:35:28] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10BTullis) Thanks @JMeybohm - I'll definitely bear that in mind. From my work so far with the spark-operator it seems that the operator //its...