[06:58:03] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) @kostajh, we published datasets for all 12/15 models in this round that passed the evaluation.
[07:02:41] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira)
[08:09:05] <elukey>	 good morning :)
[08:36:49] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10elukey) @isarantopoulos nice test! Do we see CPU throttling when the test runs? For example, I don't see anything in https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod...
[09:20:09] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) @elukey you are right. I put it as boolean, but `true` in yaml is translated to `True` in python and the comparison is actually comparing strings so `True=="Tru...
[09:40:14] <elukey>	 isaranto: new pods up!
[09:41:11] <isaranto>	 U rock, thanks!
[09:43:24] <elukey>	 There should be an explicit logging for the process pool
[09:43:35] <elukey>	 (if it is enabled)
[09:45:44] <isaranto>	 elukey: I do see it `Create a process pool of 5 workers to support model scoring blocking code`. however it states 
[09:45:44] <isaranto>	 ```
[09:45:44] <isaranto>	 [I 221129 09:40:04 model_server:125] Will fork 1 workers
[09:45:44] <isaranto>	 [I 221129 09:40:04 model_server:128] Setting max asyncio worker threads as 9
[09:45:44] <isaranto>	 ```
[09:46:40] <elukey>	 isaranto: ah yes yes, lemme explain - the asyncio worker threads are created by kserve, and asyncio does the same if not instructed elsewhere (with a different number of workers). 
[09:47:25] <elukey>	 basically asyncio offers by defaults some threads that you can use to offload "blocking" IO calls (like a HTTP call made via requests)
[09:48:03] <elukey>	 IIUC this helps avoid blocking the main loop thread, and in theory it should allow asyncio to poll the thread pool in a more async way
[09:48:18] <elukey>	 https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor has some info as well
[09:49:01] <elukey>	 I am not 100% clear about the details, but the thread pool has the problem of not being able to cope with cpu-bound code (since it stalls everything until the cpu computation is finished)
[09:49:58] <elukey>	 meanwhile the process pool is handled differently - new python processes are created, and the asyncio lib behind the scenes uses pickle to serialize/deserialize functions+parameters to execute on other processes
[09:55:34] <isaranto>	 thanks for clarifying. I will rerun the same test now
[10:17:47] <elukey>	 klausman: o/ 
[10:17:57] <klausman>	 \o
[10:18:11] <elukey>	 I found something interesting today, while checking the alarms for k8s api latencies (they are for all clusters)
[10:18:24] <elukey>	 at some point in the knative controller logs (and in others as well, all knative related) I see
[10:18:27] <elukey>	 dial tcp: lookup kubernetes.default.svc.cluster.local: Temporary failure in name resolution
[10:18:45] <elukey>	 and the timings align with the increase in 504 registered
[10:18:47] <klausman>	 :fry squint: DNS. Again?
[10:19:01] <elukey>	 I restarted the kube-api servers in eqiad and everything cleared out
[10:19:09] <elukey>	 nah I think it is a knative bug
[10:19:45] <klausman>	 Hmm. So this DNS resolution is happening against the k8s DNS resolvers, right?
[10:20:02] <elukey>	 in theory yes
[10:20:10] <klausman>	 or do you think it's a knative-interal resolver that falls over somehow?
[10:21:38] <elukey>	 nono it shouldn't have an internal resolver
[10:21:42] <elukey>	 https://grafana-rw.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?orgId=1&var-dc=codfw%20prometheus%2Fk8s-mlserve&from=now-2d&to=now looks weird
[10:21:49] <elukey>	 it should be when Ilias deployed
[10:22:45] <klausman>	 yesterday evening?
[10:23:01] <klausman>	 Or this morning?
[10:23:37] <elukey>	 
[10:23:38] <elukey>	 https://sal.toolforge.org/production?p=0&q=isaranto&d=
[10:23:56] <elukey>	 weird it was way before
[10:24:08] <klausman>	 is SAL in UTC?
[10:24:28] <elukey>	 yes
[10:24:58] <klausman>	 well, you linked the prod cluster, not staging.
[10:25:15] <klausman>	 and yesterday's pushes in SAL are for staging
[10:25:31] <elukey>	 klausman: yes in fact I mentioned "weird it was way before" -> namely on the 24th
[10:25:42] <elukey>	 so it doesn't match
[10:25:49] <klausman>	 aaaah
[10:26:06] <klausman>	 well, the staging curves also don't quite line up with pushes, for that matter
[10:26:42] <elukey>	 after restarting kube-api on ml-serve-codfw the latencies recovered as well
[10:27:01] <elukey>	 only staging is left
[10:27:10] <elukey>	 maybe we can use it to see if we can find anything useful
[10:28:16] <elukey>	 I need to go out for an errand, definitely weird
[10:28:23] <elukey>	 I didn't expect that drop in DNS queries
[10:28:24] <elukey>	 mmmm
[10:28:29] <klausman>	 In the 7d graphs of latencies for serv-codfw, do you see an increase over the day yesterday? Or am I imagining things?
[10:28:48] <elukey>	 kube api latencies? Can you link the graph?
[10:29:00] <klausman>	 https://grafana.wikimedia.org/goto/eyU7SMK4k?orgId=1
[10:29:59] <klausman>	 Before midnight on the 29th, it's a lot more volatile, but with lower minimum latencies. Afterwards, it's more steady but higher overall
[10:30:24] <elukey>	 yep
[10:30:27] <elukey>	 no idea
[10:30:46] * elukey afk for a bit
[11:21:53] <elukey>	 back!
[11:27:19] <aiko>	 o/
[11:36:16] <elukey>	 klausman: ok a 30d view looks less weird https://grafana.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?from=now-30d&orgId=1&to=now&var-dc=codfw%20prometheus%2Fk8s-mlserve
[11:36:39] <elukey>	 I have no idea why we have that variation of dns requests, but the drop is not something completely unexpected
[11:37:18] <klausman>	 Latency is pretty consistent over that timeframe as well
[11:37:33] <klausman>	 But yeah, the # of requests is weird
[11:37:50] <elukey>	 we'll see, let's keep an eye on it
[11:38:00] <elukey>	 my bet is that the knative controllers are full of bugs
[11:38:07] <elukey>	 0.18.1 was released 2y ago
[11:41:49] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) Re-run the test and edited the previous message. Much better results, and it seems that latency doesn't increase over time as it happens in the non MP version....
[11:44:47] <elukey>	 isaranto: \o/
[11:45:10] <isaranto>	 elukey: /o
[11:45:50] <elukey>	 heading out for lunch!
[11:46:04] <isaranto>	 same here
[12:08:26] * klausman as well
[13:19:44] <elukey>	 aiko: o/
[13:20:00] <elukey>	 about the 1000 request test - was it done pyspark -> Lift Wing -> MW API?
[13:20:21] <elukey>	 I can try to inspect logs if you re-run it, so we can see if anything weird pops up
[13:34:09] <aiko>	 elukey: yeah it was pyspark -> Lift Wing -> MW API
[13:34:35] <aiko>	 ok I'm gonna rerun it
[13:38:47] <elukey>	 klausman: I was reading https://knative.dev/docs/serving/services/custom-domains, that seems interesting
[13:39:13] <elukey>	 IIUC with new knative versions we could avoid the long isvc-name.namespace.wikimedia.org
[13:39:21] <elukey>	 in theory, omitting the namespace etc..
[13:39:44] <klausman>	 Hmm, interesting. That would make the API GW rewrites a little simpler, too
[13:39:52] <elukey>	 knative ships one extra controller and a webhook for these things, seems a little overkill but 
[13:40:38] <klausman>	 In an ideal world, the API GW would auto-discover the services on LW and add relevant routing info.
[13:40:48] <klausman>	 An SRE can dream :)
[13:41:04] <elukey>	 yeah sure :D
[13:41:25] <elukey>	 let's keep the current config for api-gw but keep it in mind that after the upgrade we could in theory simplify
[13:41:31] <klausman>	 Ack.
[13:42:05] <klausman>	 I'm also trying to figure out if we could make the config a bit more DRY-y, but it's not at the forefront of the effort
[13:43:15] <aiko>	 elukey: I just re-ran it and it was hitting inference-staging endpoint
[13:43:49] <aiko>	 elukey: 44 missing responses this time
[13:50:51] <elukey>	 aiko: what isvc?
[13:51:44] <elukey>	 enwiki-goodfaith ok
[13:51:52] <aiko>	 revert-risk-model
[13:52:21] <aiko>	 experimental namespace
[13:52:29] <elukey>	 ah snap I got sidetracked by a previous test attempt sorry (just opened the istio dashboard)
[13:53:40] <chrisalbon>	 Morning all!
[13:55:02] <elukey>	 o/
[13:55:10] <elukey>	 aiko: checked the istio-proxy logs, all 200s afaics
[13:57:50] <aiko>	 elukey: yeah that's the weird part.. mwapi return 200 but the response was something like {'batchcomplete': True, 'query': {'pages': [{'pageid': 1102369, 'missing': True}]}} if you check kserve logs
[13:59:20] <elukey>	 ok so in the last batch I can see two queries
[13:59:21] <elukey>	 /w/api.php?action=query&formatversion=2&prop=revisions&pageids=60790710&rvlimit=1&rvdir=newer&rvslots=main&format=json
[13:59:31] <elukey>	 "/w/api.php?action=query&formatversion=2&prop=revisions&pageids=1102369&rvlimit=1&rvdir=newer&rvslots=main&format=json"
[13:59:36] <elukey>	 are those right?
[13:59:40] <elukey>	 as we expect them I mean
[14:01:11] <elukey>	 aiko: --^
[14:02:05] <aiko>	 yeah that's right
[14:03:14] <aiko>	 two rev_ids (es) 142965340 -> pageid is 1102369, and (en) 1099209076 -> pageid is 60790710
[14:04:27] <aiko>	 not only pageid, we would also query their parent rev_id
[14:05:27] <elukey>	 aiko: what about rvlimit etc..?
[14:05:34] <elukey>	 is mwapi setting different ones?
[14:06:23] <elukey>	 just trying to find differences
[14:06:54] <aiko>	 the query was written here https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/revision.py#L208
[14:07:18] <aiko>	 not sure the purpose of rvlimit 
[14:08:05] <elukey>	 aiko: and when you tested pyspark -> mwapi did you add the same parameters?
[14:08:20] <isaranto>	 a bit (or a lot) offtopic: What is the retention on kafka topics? and has anyone figured out how we could do checkpointing with benthos?or even start from specific offset
[14:09:12] <elukey>	 isaranto: IIRC we have size based + time based retention (so if the size exceeds we drop automatically, otherwise we keep it for a week I think, lemme check)
[14:09:13] <aiko>	 elukey: yeah I used the same parameters 
[14:09:47] <isaranto>	 elukey: cool, thanks! I don't need anything specific, was just curious
[14:11:04] <elukey>	 isaranto: 168 hours :)
[14:11:27] <elukey>	 for benthos, I believe that if you start it as consumer group it stops/starts from the same offset (the last committed)
[14:11:46] <isaranto>	 ack
[14:11:49] <elukey>	 checkpointing is probably something more advanced, I heard it for Flink but not benthos
[14:16:46] <elukey>	 aiko: (I keep asking trivial questions please stop me if I say something silly) - do we know if these pages were deleted previously by any chance?
[14:16:56] <elukey>	 I wonder if they have anything special
[14:17:14] <elukey>	 but yeah I cannot repro as well via curl
[14:20:10] <chrisalbon>	 There are no trivial questions, only questions
[14:21:02] <elukey>	 chrisalbon: I am struggling to keep up with Aiko's speed so I come up with excuses :D
[14:24:22] <elukey>	 aiko: one weird thing - If I query the following I get only missing: true
[14:24:23] <elukey>	 https://api-ro.discovery.wmnet/w/api.php?action=query&formatversion=2&prop=revisions&pageids=1102369&rvlimit=1&rvdir=newer&rvslots=main&format=json
[14:24:34] <elukey>	 and this is one of the two pageid queries
[14:24:46] <elukey>	 from stat1004
[14:25:27] <elukey>	 even https://en.wikipedia.org/w/api.php?action=query&formatversion=2&prop=revisions&pageids=1102369&rvlimit=1&rvdir=newer&rvslots=main&format=json
[14:26:57] <elukey>	 yeah https://en.wikipedia.org/w/index.php?curid=1102369
[14:32:39] <aiko>	 elukey: no no they're not trivial questions
[14:33:27] <aiko>	 elukey: ah pageid 1102369 is from eswiki, you were querying enwiki
[14:34:32] <aiko>	 https://es.wikipedia.org/w/index.php?curid=1102369
[14:35:50] <elukey>	 ahh
[14:36:03] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) And various test with wrk  1 minute - 1 connection - 1 thread ` isaranto@deploy1002:~/scripts$ wrk -c 1 -t 1 --timeout 2s -s inference.lua https://inference-sta...
[14:39:57] <aiko>	 elukey: the weird thing is out of 1000 requests, 44 responses missing. so there were still 956 requests fetching data correctly for these two pages.
[14:40:49] <elukey>	 aiko: the even weirder thing is that via curl I don't see any glitch even with tons of requests
[14:41:05] <elukey>	 and the response is too mw-api specific, it is not raised by istio etc..
[14:43:08] <elukey>	 aiko: and you also wrote a script in python using the knowledge integrity package, calling get_page say 500 times ?
[14:43:08] <aiko>	 elukey: yeah it seems to only happen with pyspark
[14:43:44] <aiko>	 pyspark -> lift wing
[14:52:01] <aiko>	 elukey: yep I've tried that and that's no problem
[16:19:13] <ottomata>	 isaranto:  o/ we are working on Flink stuff as part of Event Platform : https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream if you are interested we'd love to talk sometime :)
[16:33:55] <isaranto>	 ottomata: /o nice to meet u :) the team has told me about the ongoing effort with Flink, sounds super cool. we could def have a chat (even to get to know each other)
[16:36:13] <wikibugs>	 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) When deploying the updated calico chart to my test cluster I realized that the fixed dependency to the CRD chart does not mean tha...
[16:44:01] <ottomata>	 isaranto: cool!  i put short meeting on our calendars for monday, feel free to move it around :)
[16:44:47] <isaranto>	 ottomata: looks good, cu then!
[17:25:02] <wikibugs>	 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) CC @BTullis & @bking: This might be relevant for operators as well.
[17:33:19] * elukey afk!
[17:33:22] <elukey>	 o/
[17:35:28] <wikibugs>	 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10BTullis) Thanks @JMeybohm - I'll definitely bear that in mind.  From my work so far with the spark-operator it seems that the operator //its...