[00:05:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [00:05:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [00:05:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:24:43] FIRING: LiftWingServiceErrorRate: ... [04:24:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [04:34:43] RESOLVED: LiftWingServiceErrorRate: ... [04:34:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:47:18] 10Lift-Wing, 06Machine-Learning-Team: [onboarding] Improving language agnostic articlequality model + service - https://phabricator.wikimedia.org/T391679#10746401 (10OKarakaya-WMF) [06:47:39] 10Lift-Wing, 06Machine-Learning-Team: [onboarding] Improving language agnostic articlequality model + service - https://phabricator.wikimedia.org/T391679#10746402 (10OKarakaya-WMF) [07:06:10] hey folks, as FYI ml-serve2007 was down, OEM error registered (DIMM issue afaics) [07:06:21] after a powercycle it worked fine, let's keep it monitored [07:14:17] hello! [07:14:29] ack Luca! [07:19:54] Good morning. [07:23:17] Good morning folks [07:30:22] hi George! welcome back! [08:05:38] Bore da, pawb! [08:40:48] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10746593 (10matej_suchanek) In my opinion, "unlikely to be reverted" is unlikely to be of any... [09:11:02] klausman: o/ ok if I do https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136728 ? [09:11:33] \o Yes and thank you! [09:13:22] np! Very easy since it will not involve pybal restarts, will do 1002 too [09:13:51] Yeah, I considered starting with the workers this week, but given the five day(!) weekend, I'll defer until after Easter [09:19:38] klausman: my2c - let's take this time to plan it properly, if I am needed or not etc.. the execution time can happen anytime [09:20:13] ack! I would also coordinate with traffic on how we can minimize the number of pybal restarts [09:21:06] there is also the possibility of depooling eqiad from the liftwing's lvs service, do some reimages, fix pybal, repool [10:04:21] ml-serve-ctrl1001 done [10:31:11] * isaranto afk lunch [11:12:45] (03PS1) 10Gkyziridis: inference-services: edit-check. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1136981 (https://phabricator.wikimedia.org/T386100) [11:48:21] * klausman lunch [12:42:26] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Try SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10747334 (10achou) [12:42:42] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Use SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10747336 (10achou) [12:59:19] 10Lift-Wing, 06Machine-Learning-Team: [onboarding] Improving language agnostic articlequality model + service - https://phabricator.wikimedia.org/T391679#10747446 (10OKarakaya-WMF) I've experimented with some new features: - len_images: number of images in the article. - title: title of the article as text. -... [13:20:51] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10747518 (10Kgraessle) >>! In T348298#10746593, @matej_suchanek wrote: > In my opinion, "unlik... [14:25:01] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Use SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10748029 (10achou) Peacock check service with SHAP value support has been deployed to Lift Wing. You can enable it by adding the parameter `return_shap_value... [15:56:21] * isaranto afk! [16:29:12] (03PS2) 10Gkyziridis: inference-services: edit-check. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1136981 (https://phabricator.wikimedia.org/T386100) [16:55:34] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Use SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10748905 (10achou) ### Load Testing We want to compare the latency between requests with and without returning SHAP values. ##### Test #1 * return_shap_va... [17:01:23] ---^ shap value does impact the latency greatly