[07:29:06] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) Model evaluation has been completed and below are the backtesting results: | | Precision@0.5 | Recall@0.5 |arywiki | 0.79 | 0.44 |dawiki... [09:23:54] elukey: the request latency increas was very sharp and started two days ago (21st, 00:53) https://grafana.wikimedia.org/goto/bPSiLBO4k?orgId=1 [09:24:20] At a time like that, I very much doubt it was any of us messing with something [09:24:40] this time the issue was different [09:24:41] "List" url:/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations [09:24:46] so not knative-related [09:24:54] Also, at the same time API Errors jumped up as well [09:25:07] https://grafana.wikimedia.org/goto/u7TMYfOVk?orgId=1 [09:25:22] yep yep [09:26:55] And on eqiad, the jump up was teh evening before, at 2157. Weird. [09:27:48] And now the error rates and latency are low again? [09:27:51] so far the mitigation that I have been using are: 1) restart the api-servers 2) kill/respawn the knative pods [09:28:04] ah, that explains that :) [09:28:15] klausman: yeah I have restarted the kube api (but not killed the pods) [09:28:19] I wonder if this is a state they "just" get into after running for a while [10:40:47] at the moment we can only deploy the models one by one right? which one should I deploy first? (on staging I mean) [10:41:10] elukey: which model is the one we want to test with MP? [10:48:14] isaranto: o/ [10:48:30] yes one at the time, all revscoring-based ones [10:48:33] one nit [10:48:33] /o I deployed all of them on staging for now [10:48:53] we currently have some issues while deploying a lot of pods (basically only in prod): [10:49:23] 1) a ton of log spam is pushed to Kafka logging clusters (where syslog is sent etc..) and an alert fires, but people should be aware (in case they ping you) [10:49:34] u mean when we deploy a lot of pods at the same time? [10:50:22] 2) k8s latencies might suffer if we rollout different model-servers / namespaces in close sequence (like: revscoring-editquality-goodfaith && 10 sec later another one && another one) [10:50:32] so we prefer to take some mins between each deploy in prod [10:50:42] it is a longer procedure I know but better to be careful [10:51:14] isaranto: sorry I meant namespaces where we have a lot of inference services deployed (like revscoring-editquality-goodfaith, check the number of pods in prod for those) [10:51:27] staging is usually fine, you can go quicker [10:52:17] isaranto: also if you can join #wikimedia-operations on IRC [10:53:42] (we have bots posting alerts etc..) [11:34:01] * elukey lunch! [11:44:53] ditto [13:32:37] where are our alerts redirected? irc, slack somewhere else? [13:34:40] https://alerts.wikimedia.org/ ? [13:38:02] what I mean is: do all of them go to #wikimedia-operations irc channel or are they grouped/labeled per team/namespace and sent elsewhere? [13:38:02] As u understand I'll be posting random questions here , I hope I wont annoy any1 :D . Whoever knows may answer, if I don't find the answer or I dont understand sth I'll ask again [14:02:56] also, what is the uri for querying the ml-staging models in codfw? Should I be able to figure it out myself from somewhere? Seems like I don't have list access for inferenceservices as well (isvc) [14:03:47] export isaranto_question_spamming_mode=OFF :D [14:07:07] AIUI, most alets go to the ops channel, all are visible on alerts, and some go to team specific channels [14:11:15] yep --^ [14:11:31] alerts.wikimedia.org is the place to check, it also groups for teams etc.. [14:11:34] very handy [14:11:46] and they are posted to #operations as well [14:12:13] it is nice to be in the chan so if an alert fires you can write and say that it is related to a deployment or XYZ [14:12:20] so others SRE will not take actions [14:13:35] the url for staging is: inference-staging.discovery.wmnet:30443, and for prod you just remove the -staging bit. You could get it from puppet by yourself but that repo it is a collection of sadness and horrors that I'd skip for you at the moment :) [14:13:39] isaranto: --^ [14:16:58] thanks! [14:18:23] and please keep posting any question in here, so people can read and answer at any time [14:50:08] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) The conclusion on the backtesting results is that most of the languages look fine but there are some **redflags**: - //diqwiki// has very... [15:00:24] elukey: inference-staging.svc.codfw.wmnet seems to work but could not resolve host for inference-staging.discover.wmnet [15:00:45] Can someone invite me to the team meeting plz ? :) [15:01:56] done, thanks! [15:02:39] 10Lift-Wing, 10Machine-Learning-Team: Decide external URL scheme (on API GW) for models on Lift Wing - https://phabricator.wikimedia.org/T319178 (10klausman) After some discussion, we have decided that the API-GW side URL scheme for LW should look like: `/lw/inference/v1/models/[model name]:predict` so for e... [15:02:51] isaranto: ah yes sorry my bad! [15:02:57] discovery is only for prod [15:03:15] (basically it balances between eqiad and codfw's svc endpoints based on the client's location) [16:19:23] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10CommRel-Specialists-Support (Oct-Dec-2022): Inspect "add a link" models to improve their performance - https://phabricator.wikimedia.org/T309263 (10Trizek-WMF) [16:20:53] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10CommRel-Specialists-Support (Oct-Dec-2022): Inspect "add a link" models to improve their performance - https://phabricator.wikimedia.org/T309263 (10Trizek-WMF) [16:21:26] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10Trizek-WMF) I added the three wikis we skip to the to-be-checked list at T309263. [16:49:45] have a nice rest of the day folks! [16:49:47] * elukey afk [16:51:14] \o