[07:29:06] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) Model evaluation has been completed and below are the backtesting results:  | | Precision@0.5 | Recall@0.5 |arywiki | 0.79 | 0.44 |dawiki...
[09:23:54] <klausman>	 elukey: the request latency increas was very sharp and started two days ago (21st, 00:53) https://grafana.wikimedia.org/goto/bPSiLBO4k?orgId=1
[09:24:20] <klausman>	 At a time like that, I very much doubt it was any of us messing with something
[09:24:40] <elukey>	 this time the issue was different
[09:24:41] <elukey>	 "List" url:/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations
[09:24:46] <elukey>	 so not knative-related
[09:24:54] <klausman>	 Also, at the same time API Errors jumped up as well
[09:25:07] <klausman>	 https://grafana.wikimedia.org/goto/u7TMYfOVk?orgId=1
[09:25:22] <elukey>	 yep yep
[09:26:55] <klausman>	 And on eqiad, the jump up was teh evening before, at 2157. Weird.
[09:27:48] <klausman>	 And now the error rates and latency are low again?
[09:27:51] <elukey>	 so far the mitigation that I have been using are: 1) restart the api-servers 2) kill/respawn the knative pods
[09:28:04] <klausman>	 ah, that explains that :)
[09:28:15] <elukey>	 klausman: yeah I have restarted the kube api (but not killed the pods)
[09:28:19] <klausman>	 I wonder if this is a state they "just" get into after running for a while
[10:40:47] <isaranto>	 at the moment we can only deploy the models one by one right? which one should I deploy first? (on staging I mean) 
[10:41:10] <isaranto>	 elukey: which model is the one we want to test with MP?
[10:48:14] <elukey>	 isaranto: o/
[10:48:30] <elukey>	 yes one at the time, all revscoring-based ones
[10:48:33] <elukey>	 one nit
[10:48:33] <isaranto>	  /o I deployed all of them on staging for now
[10:48:53] <elukey>	 we currently have some issues while deploying a lot of pods (basically only in prod):
[10:49:23] <elukey>	 1) a ton of log spam is pushed to Kafka logging clusters (where syslog is sent etc..) and an alert fires, but people should be aware (in case they ping you)
[10:49:34] <isaranto>	 u mean when we deploy a lot of pods at the same time?
[10:50:22] <elukey>	 2) k8s latencies might suffer if we rollout different model-servers / namespaces in close sequence (like: revscoring-editquality-goodfaith && 10 sec later another one && another one)
[10:50:32] <elukey>	 so we prefer to take some mins between each deploy in prod
[10:50:42] <elukey>	 it is a longer procedure I know but better to be careful
[10:51:14] <elukey>	 isaranto: sorry I meant namespaces where we have a lot of inference services deployed (like revscoring-editquality-goodfaith, check the number of pods in prod for those)
[10:51:27] <elukey>	 staging is usually fine, you can go quicker
[10:52:17] <elukey>	 isaranto: also if you can join #wikimedia-operations on IRC
[10:53:42] <elukey>	 (we have bots posting alerts etc..)
[11:34:01] * elukey lunch!
[11:44:53] <klausman>	 ditto
[13:32:37] <isaranto>	 where are our alerts redirected? irc, slack somewhere else?
[13:34:40] <isaranto>	 https://alerts.wikimedia.org/ ?
[13:38:02] <isaranto>	 what I mean is: do all of them go to #wikimedia-operations irc channel or are they grouped/labeled per team/namespace and sent elsewhere? 
[13:38:02] <isaranto>	 As u understand I'll be posting random questions here , I hope I wont annoy any1 :D . Whoever knows may answer, if I don't find the answer or I dont understand sth I'll ask again 
[14:02:56] <isaranto>	 also, what is the uri for querying the ml-staging models in codfw? Should I be able to figure it out myself from somewhere? Seems like I don't have list access for inferenceservices as well (isvc)
[14:03:47] <isaranto>	 export isaranto_question_spamming_mode=OFF :D 
[14:07:07] <klausman>	 AIUI, most alets go to the ops channel, all are visible on alerts, and some go to team specific channels
[14:11:15] <elukey>	 yep --^
[14:11:31] <elukey>	 alerts.wikimedia.org is the place to check, it also groups for teams etc..
[14:11:34] <elukey>	 very handy
[14:11:46] <elukey>	 and they are posted to #operations as well
[14:12:13] <elukey>	 it is nice to be in the chan so if an alert fires you can write and say that it is related to a deployment or XYZ
[14:12:20] <elukey>	 so others SRE will not take actions
[14:13:35] <elukey>	 the url for staging is: inference-staging.discovery.wmnet:30443, and for prod you just remove the -staging bit. You could get it from puppet by yourself but that repo it is a collection of sadness and horrors that I'd skip for you at the moment :) 
[14:13:39] <elukey>	 isaranto: --^
[14:16:58] <isaranto>	 thanks!
[14:18:23] <elukey>	 and please keep posting any question in here, so people can read and answer at any time
[14:50:08] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) The conclusion on the backtesting results is that most of the languages look fine but there are some **redflags**:  - //diqwiki// has very...
[15:00:24] <isaranto>	 elukey: inference-staging.svc.codfw.wmnet seems to work but could not resolve host for inference-staging.discover.wmnet
[15:00:45] <isaranto>	 Can someone invite me to the  team meeting plz ? :)
[15:01:56] <isaranto>	 done, thanks!
[15:02:39] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Decide external URL scheme (on API GW) for models on Lift Wing - https://phabricator.wikimedia.org/T319178 (10klausman) After some discussion, we have decided that the API-GW side URL scheme for LW should look like:  `/lw/inference/v1/models/[model name]:predict`  so for e...
[15:02:51] <elukey>	 isaranto: ah yes sorry my bad! 
[15:02:57] <elukey>	 discovery is only for prod
[15:03:15] <elukey>	 (basically it balances between eqiad and codfw's svc endpoints based on the client's location)
[16:19:23] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10CommRel-Specialists-Support (Oct-Dec-2022): Inspect "add a link" models to improve their performance - https://phabricator.wikimedia.org/T309263 (10Trizek-WMF)
[16:20:53] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10CommRel-Specialists-Support (Oct-Dec-2022): Inspect "add a link" models to improve their performance - https://phabricator.wikimedia.org/T309263 (10Trizek-WMF)
[16:21:26] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10Trizek-WMF) I added the three wikis we skip to the to-be-checked list at T309263.
[16:49:45] <elukey>	 have a nice rest of the day folks!
[16:49:47] * elukey afk
[16:51:14] <klausman>	 \o