[02:54:29] 10Machine-Learning-Team, 10artificial-intelligence, 10editquality-modeling, 10Chinese-Sites: Include pinyin for zhwiki damaging model - https://phabricator.wikimedia.org/T223750 (10Shizhao) [08:48:39] hello folks [08:53:24] klausman: o/ [08:53:34] so https://phabricator.wikimedia.org/T327767#8630392 lists the two remaining issues for knative-serving [08:53:42] two pods fail to bootstrap for some reason [08:53:45] - the webhook [08:54:04] - the domain-controller-webhook (new one, still not 100% sure if we need it or not but it is part of the standard deployment IIUC) [08:55:05] the latter seems to be complaining about a logging config issue (still didn't get where), the former is more subtle - the self-signed CA/certificate is not autogenerated in time before the webhook starts getting traffic and it fails [09:04:13] ah maybe I found something for the domain controller [09:04:18] I didn't add the network policies [09:08:23] basically https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/890775 [09:19:20] elukey: \O [09:20:56] LGTM'd [09:21:03] thanks :) [09:21:22] the remaining one is the webhook, I am going to re-deploy to see if anything changed [09:21:27] Ack. [09:22:01] Unrelatedly I got a mail from WMCS about wikilabels machines still being on stretch (I couldn't find any), I have replied thus and will follow up on finding out what's going there [09:23:18] super [09:24:23] ah interesting https://github.com/knative/serving/pull/9661 [09:24:39] from 2020 but apparently it is not in 1.7.0's config afaics [09:25:02] Looks like an easy enough fix. [09:25:41] yes but I am a little puzzled, knative 1.7.0 should be more recent than 2020 [09:26:54] https://github.com/knative/serving/releases/tag/knative-v1.7.2 [09:26:56] yeah 2022 [09:29:01] Hmmm. [09:29:21] ok I checked https://github.com/knative/serving/releases/download/knative-v1.9.0/serving-core.yaml and the issue is still there [09:31:04] So I asked git what tags contain commit c7f29cdbba7302c84aade261f49431d2f7d73a8f (https://github.com/knative/serving/commit/c7f29cdbba7302c84aade261f49431d2f7d73a8f) [09:31:17] and knativ-1.7.0 is there [09:31:35] But so is 1.0.0, so my git command is BS :D [09:32:39] Hm, even GH agrees that it should be in 1.7.0 [09:33:03] the serving-core.yaml stuff may be handled elsewhere probably [09:33:14] yeah, true [09:34:39] well in theory the relevant bits are there [09:34:39] failureThreshold: 6 [09:34:40] initialDelaySeconds: 20 [09:40:03] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10JMeybohm) Global depool of a/a services from codfw is done. [09:46:57] so I managed to make it bootstrap, extending a lot the readiness probe value [09:47:04] err sorry, the liveness probe [09:51:24] yeah with a long delay it works [10:01:04] How long? [10:01:41] 10Lift-Wing, 10Machine-Learning-Team: Investigate AlibiExplainer for Revert-Risk model - https://phabricator.wikimedia.org/T330131 (10achou) [10:01:58] I added 120s, but afaics it wants something more than 60s [10:04:38] klausman: filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/890780 [10:10:23] ack [10:11:36] lgtm [10:23:02] thankss [10:23:49] let's see if this time is better [10:31:58] ok there is a new problem, but the others are fixed :D [10:33:08] ah it seems again the liveness probe, but for another pod (domainmapping-webhook) [10:33:58] yeah same thing [10:33:59] uff [10:34:28] sending a change, but a manual tweak did the trick [10:34:30] all pods up! [10:34:37] (knative ones) [10:34:57] 10Lift-Wing, 10Machine-Learning-Team: Investigate AlibiExplainer for Revert-Risk model - https://phabricator.wikimedia.org/T330131 (10achou) Before proceeding with attaching an explainer to revert-risk isvc, we should test the explanation algorithm of interest on statbox to see if the explanation the model ret... [10:35:46] elukey: nice work! [10:37:18] trying kserve now [10:37:34] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/890783/ [10:38:39] worked :) [10:39:17] let's try with some isvcs [10:42:24] goodfaith pods up, but of course a request doesn't work (404) [10:44:52] was the "of course" sacrcasm or actually expected? [10:45:19] the former :D [10:47:24] something may have changed in the routing bits between istio/knative [10:47:35] I see the proxy config via istioctl deployed correctly [10:51:01] ah yes! [10:51:03] enwiki-goodfaith-predictor-default.revscoring-editquality-goodfaith.wikimedia.org [10:51:07] there is the -default now [10:52:08] yep works :) [10:52:35] isaranto: I am still working on ml-staging :) [10:52:55] 🤦‍♂️ [10:53:10] but if it works please go ahead :) [10:53:15] of course , I wasn't concentrated [10:53:22] I mean I was about to roll out other namespaces [10:53:30] you got the right moment to deploy [10:53:34] :D [10:53:44] perfect timing (we just fixed the last bugs and deployed kserve) [10:53:56] so far goodfaith works, but I had to modify the host header [10:54:03] (see above) [10:54:45] it is not my day today (mentally) [10:55:39] you were synced with ml-staging-codfw though, it felt ready to get new pods :D [10:57:33] elukey: btw, the row outage today in codfw will be mostly a no-op, like last time, right? [10:57:54] klausman: in theory yes [10:58:04] Alright. I'll keep an eye on it [10:58:31] but it happens later on right? [10:59:14] Calendar says 15:00-17:00 CET [10:59:37] ack [11:00:47] tested sending an event to eventgate, I get a 500 in return [11:02:56] ahh there you go [11:02:57] EVENTGATE_STREAM: mediawiki.revision-score-goodfaith [11:03:02] this should have been "test" [11:03:05] my bad, fixing [11:04:17] but so far most of the things work [11:10:33] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond) [11:16:40] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond) [11:18:11] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [11:18:42] created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/890795 [11:18:51] to fix/update the eventgate's stream values [11:33:11] I have deployed all model servers and didn't see any alert firing [11:33:22] good improvement :) [11:37:17] aaaand events are ok now, I see them in kafka [11:37:23] so in theory... upgrade done [11:37:26] \o [11:37:29] \o/ [11:37:39] now we have to check for errors alarms etc.. during the next days [11:37:56] lemme run also Ilias' httpbb script [11:40:23] ah it doesn't work because of the -default change in the Host header [11:40:27] will check it after lunch :) [11:40:33] * elukey lunch [11:41:24] * isaranto lunch and a run [12:41:16] 10Machine-Learning-Team: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) Cluster up and running with k8s 1.23! All pods are working, and I already noticed some improvements: no alert has been fired (latency-wise) when deploying all the model-server pods. I... [12:46:36] * klausman lunch [12:54:55] 10Machine-Learning-Team, 10Edit-Review-Improvements-Integrated-Filters, 10Growth-Team, 10Research: Integration of Revert Risk Scores to Recent Changes as a filter - https://phabricator.wikimedia.org/T329071 (10achou) [13:07:06] 10Lift-Wing, 10Machine-Learning-Team: Support the Revert-Review API/tool on Toolforge - https://phabricator.wikimedia.org/T330148 (10achou) [13:09:18] 10Lift-Wing, 10Machine-Learning-Team: Support the Revert-Review API/tool on Toolforge - https://phabricator.wikimedia.org/T330148 (10achou) If we decide to call Lift Wing API (public) to get the revert-risk scores, the ML team needs to add the revert-risk model to the API-Gateway. Question: * What is the exp... [13:13:04] 10Lift-Wing, 10Machine-Learning-Team: Support the Revert-Review API/tool on Toolforge - https://phabricator.wikimedia.org/T330148 (10achou) [13:18:03] 10Lift-Wing, 10Machine-Learning-Team: Support the Revert-Review API/tool on Toolforge - https://phabricator.wikimedia.org/T330148 (10achou) [13:40:03] 10Machine-Learning-Team: Investigate procuring and installing two GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) Super interesting article found by Ilias: https://journal.arrikto.com/gpu-virtualization-in-k8s-challenges-and-state-of-the-art-a1cafbcdd12b Not sure if https://docs.nvidia... [13:41:29] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=aec8ddda-9ad5-4b7f-8bca-c273e036a282) set by ayounsi@cumin1001 for 2:0... [13:49:39] great work team! 🎉 [13:50:34] I am getting errors from article-topic outlink [13:50:50] ah great :) [13:50:55] what do you get? [13:51:26] I see a `ReconcileIngressFailed` in the 2 routes that exist (predictor, transformer) which if I describe it I get `TLSNotEnabled` [13:52:40] wait a min, how are you checking it? [13:54:21] I ran a kubectl get all [13:55:22] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Vgutierrez) [13:55:22] ah didn't know it [13:55:28] and also when I ran ```curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/outlink-topic-model:predict" -X POST -d @input_outlink.json -i -H "Host: outlink-topic-model-transformer-default.articletopic-outlink.wikimedia.org" --http1.1``` [13:55:44] I got ```{"error": "[Errno -2] Name or service not known"``` [13:55:57] interesting, I see the same error for the other pods [13:55:59] but they work [13:56:21] for models that use a transformer we hit the transformer which then communicates with the predictor. maybe it has to do with that? [13:57:16] nono even for revscoring models I see the reconcile failures [13:57:17] autoTLS is not enabled [13:58:26] aa ok then. I just pointed it out cause I saw it was ok in prod. [14:00:31] nono it is definitely an issue [14:00:35] I was quoting the error msg [14:00:41] it must be a new thing in knative [14:02:56] isaranto: the weird thing is that all works [14:14:21] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10kevinbazira) The conclusion on the backtesting results is that most of the languages look fine besides: - klwiki (0.74), and kmwiki (0.70) have a prec... [14:37:14] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) Upgrade went smoothly, less than 15min hard downtime here too. [14:37:52] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [14:45:14] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond) [14:51:22] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi) [15:00:38] folks I am helping service ops right now, I'll be 10 mins late [15:05:56] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jcrespo) I restarted es5 codfw backup job, the only backup-related thingy affected by the downtime. [15:06:38] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) p:05Triage→03Medium [15:07:17] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) [15:11:37] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jcrespo) [15:13:16] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MoritzMuehlenhoff) [15:22:26] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10MediaWiki-Special-pages: Highlight edits that are likely to have problems different colors on User contributions page - https://phabricator.wikimedia.org/T328728 (10elukey) [15:27:26] 10Machine-Learning-Team, 10Patch-For-Review: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 (10elukey) a:03elukey [15:28:59] 10Lift-Wing, 10Machine-Learning-Team, 10Documentation: Create technical documentation for Lift Wing Infrastructure - https://phabricator.wikimedia.org/T276601 (10elukey) Hi @Chtnnh! Is it something that you are planning to work on? Or should I unassign? [15:33:23] 10Machine-Learning-Team: Automate the procedure to bootstrap minikube on the ML-Sandbox and to share it by multiple users - https://phabricator.wikimedia.org/T305447 (10elukey) a:03klausman [15:51:08] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [15:57:57] 10Lift-Wing, 10Machine-Learning-Team: Investigate AlibiExplainer for Revert-Risk model - https://phabricator.wikimedia.org/T330131 (10achou) @isarantopoulos mentioned that we can also explore the [[ https://ai-explainability-360.org/ | AI Explainability 360 (AIX360) ]], another explainability open-source libra... [15:58:15] 10Lift-Wing, 10Machine-Learning-Team: Investigate Explainer for Revert-Risk model - https://phabricator.wikimedia.org/T330131 (10achou) [16:03:27] 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search: Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10EBernhardson) once articletopic transfers to the link-based topic modeling drafttopic should be the only one we still need, afaik. [16:05:27] * elukey out for a walk [16:35:23] 10Lift-Wing, 10Machine-Learning-Team: Investigate Explainer for Revert-Risk model - https://phabricator.wikimedia.org/T330131 (10Trokhymovych) Previously, we tested the TreeSHAP algorithm for Multilingual model explainability (from here: https://shap.readthedocs.io/en/latest/). It is supported by the tools pro... [16:40:32] Logging off folks. Cu tomorrow [17:11:43] 10Machine-Learning-Team: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) The net-istio controller complains about the absence of cluster-local-gateway: ` {"level":"error","ts":"2023-02-21T16:58:19.037Z","logger":"istiocontroller.istio-ingress-controller.kn... [17:12:48] docker-registry.discovery.wmnet/knative-net-istio-controller:0.18.1-4 [17:12:53] * elukey cries in a corner [17:12:55] and it works [17:16:26] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/890883 should fix it [17:45:35] ok the bug that Ilias found it fixed! [17:47:53] going afk for today o/ [20:04:18] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10colewhite)