[06:35:09] hello folks! [07:00:16] Preview of the SLO dashboards: [07:00:16] https://grafana.wikimedia.org/dashboard/snapshot/T4ptCPjLXvIQLDXBXnIrqvauBLB5hYiz?orgId=1 [07:00:19] https://grafana.wikimedia.org/dashboard/snapshot/UNxvvxML5luo6MdCfiVdHcRbPZsVuaXa?orgId=1 [07:06:33] hello! it looks nice! [07:09:17] 10Machine-Learning-Team, 10Patch-For-Review: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10elukey) Used https://wikitech.wikimedia.org/wiki/Grafana#Previewing_the_Change_('grr_preview') to get a preview of the dashboards in https://gerrit.wikimedia.org/r/c/operations/grafana-grizz... [07:11:26] the idea is to have an SLO for each k8s namespace, basically [07:12:05] it is a tradeoff, it is a good granularity but for example we don't distinguish between model servers in the revert risk namespace [07:12:08] maybe we should [07:14:13] you mean lang-angostic vs multilingual? in this case yeah we should since they are completely different services (due to model size). For the others SLOs broken down per namespace sounds great to me [07:17:18] yep exactly, and soon-ish also wikidata [07:17:43] ack [07:54:34] FYI, ml-etcd1003 will briefly go down for a Ganeti node reboot [08:05:26] it's back [08:20:29] isaranto: this is the new version - https://grafana.wikimedia.org/dashboard/snapshot/r1qBIayOUWtsE3RNBrdUHkj75ZPF6Oiy?orgId=1 [08:20:45] we have the "revertrisk-blabla-predictor-default" as name [08:21:00] but it is the best we can do with all the prometheus labels [08:21:09] does it look ok? [08:24:32] Looks nice! I'll take a look at the docs as well and let you know if I have any comments [08:30:25] ores2008 seems not coming up after the reboot [08:30:32] ah no it was just slow [08:31:13] codfw reboots for ores are almost done (new kernels) [08:31:21] will also start the eqiad ones [08:49:40] I still can't connect to ores2008, though? [08:50:06] did you connect via the serial console, might be some issue with the NIC? [08:52:34] moritzm: sorry I was afk, yeah I can't ping, but I see the interface up [08:52:38] will check again [08:52:45] (tried to do a second reboot) [08:54:06] it might be that the NIC connector or cable died (or was dead and it only showed upon reboot), I've run into that a few times before [08:54:38] we could also just depool 2008 and have DC ops have a look [08:57:01] moritzm: never happened to see the cable/connector failing upon reboot, but the host is veeeery old sigh [08:57:57] moritzm: it is weird though that ethtool returns link detected [08:59:27] anyway, depooling and opening a task [08:59:56] ack, sounds good [09:01:51] 10Machine-Learning-Team, 10ops-codfw: Check ores2008's cable - https://phabricator.wikimedia.org/T345233 (10elukey) [09:02:27] ores is sensing that we are trying to get rid of it [09:03:01] isaranto: (just for my own mental planning) - do we have an idea about when the ores extension will migrate to Lift Wing? [09:03:37] Lol [09:03:56] (laughing about the ores sensing not the mental planning :)) [09:05:02] If everything is alright we can deploy the threshold changes to the rest of the wikis this week. An starting next week deploy liftwing gradually again [09:05:26] Unless something wonderful pops up [09:05:56] 10Machine-Learning-Team: Future support for ores scores in RC API - https://phabricator.wikimedia.org/T343813 (10elukey) 05Open→03Resolved More info - the ML team is not going to touch the RC API for the moment, but we are not able to add more model (like revert risk etc..) scores to it. This is something th... [09:06:03] super [09:06:14] At the moment we don't seem to have a blocker but I'm trying to figure out /think of a way to have some more consistent checking [09:06:22] by the end of next week I hope that WME will migrate to Lift Wing too [09:06:37] after that we would be probably ready to turn off revision-score [09:06:44] and see what's left [09:10:52] 10Machine-Learning-Team, 10Patch-For-Review: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10elukey) After some brainbounce with Ilias we decided to have separate SLOs for the revertrisk namespace (so differentiate between agnostic/multilingual/etc..). I used the `destination_canon... [09:45:49] * isaranto lunch! [09:52:09] manually testing container-concurrency-target-percentage: "85" on eqiad [09:54:21] ack [10:03:08] I am also wondering if we want to move to rps instead of concurrency as metric [10:03:21] will read some stuff [10:04:15] (I recall that Ilias pointed out that rps is more "friendly", especially checking our istio graphs) [10:13:58] going out for lunhc! [11:16:53] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Update API calls from ORES to Lift Wing - https://phabricator.wikimedia.org/T343731 (10noarave) a:03noarave [11:37:39] isaranto: I logged off yesterday, is everything okay now? [11:40:26] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Update API calls from ORES to Lift Wing - https://phabricator.wikimedia.org/T343731 (10isarantopoulos) I have opened a [[ https://github.com/wmde/wikidata-constraints-violation-checker/pull/37 | Pull Request ]... [11:40:45] Amir1: I have figured out that some of the thresholds are wrong as they need to be substracted from 1 (1- $threshold). give me 5'-10' and I'll open a patch with a description and we can discuss it over there [11:41:01] awesome [11:43:24] every time I dig a bit deeper I find something else :) [11:44:33] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Update API calls from ORES to Lift Wing - https://phabricator.wikimedia.org/T343731 (10ItamarWMDE) @isarantopoulos Thank you! We were actually just looking at the codebases, we will have a look at your PR as w... [12:05:33] Amir1: I filed a fix! You can find the explanation in the commit message https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/953590 [12:06:46] thanks [12:07:00] I checked the results of running `var_dump(ORES\Services\ORESServices::getThresholdLookup()->getThresholds( models='goodfaith', cache=false));` by turning cache on and off I could cross-check how the threshold responses were transformed [12:07:11] while the ones we deployed yesterday are wrong... [12:07:13] When should we deploy this [12:07:32] whenever you have time, starting from now :) [12:08:15] I'll also change the big patch with the correct thresholds + look into the Special:OresModels page [12:10:49] 10artificial-intelligence, 10Growth-Team, 10Technical-Tool-Request: Reputation system for Wikipedia editors - https://phabricator.wikimedia.org/T223581 (10kostajh) [12:19:09] Good morning all [12:19:16] Hey Amir1 [12:19:25] good morning! [12:20:57] morning Chris! [12:22:35] o/ [12:26:13] 10Machine-Learning-Team: Model monitoring - https://phabricator.wikimedia.org/T344819 (10elukey) +1 I like the idea! I'd avoid the push gateway if possible, we could try to use a simple prometheus exporter for this job (maybe there is a way to expose metrics via Kserve/fastapi). [12:29:10] if anybody is up for a chat about https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/953578 lemme know! [12:36:20] I'm interested but at the moment I am focused on checking the extension deployments [12:38:45] ack! [12:45:22] isaranto: deployed now [12:46:27] super duper thanks! I checked and the values have been updated. it should be fine now [12:46:57] awesome. Let me know once you're ready to move to the rest of wikis [12:48:37] I am updating the patch, double check the values against the ones I can find in mwmaint and ping you! [13:18:38] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Investigate why the add-a-link training pipeline concludes with missing datasets - https://phabricator.wikimedia.org/T344832 (10kevinbazira) 05Open→03Resolved The model training pipeline has been fixed and it now generates all the expected datasets that... [13:18:47] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review: Automate unpublishing of add-a-link datasets - https://phabricator.wikimedia.org/T344799 (10kevinbazira) [13:38:06] 10Machine-Learning-Team, 10Patch-For-Review, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10elukey) @MGerlach I added a step in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_... [14:03:41] 10Machine-Learning-Team, 10SRE, 10ops-codfw: Check ores2008's cable - https://phabricator.wikimedia.org/T345233 (10Jhancock.wm) @elukey We reseated the server and switch side of the patch. Looks like it might be the SFP. I've swapped it and the server's pinging. I'm gonna close this for now but please reopen... [14:04:24] 10Machine-Learning-Team, 10SRE, 10ops-codfw: Check ores2008's cable - https://phabricator.wikimedia.org/T345233 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:59:27] 10Machine-Learning-Team, 10SRE, 10ops-codfw: Check ores2008's cable - https://phabricator.wikimedia.org/T345233 (10elukey) @Jhancock.wm thanks a lot! [14:59:36] ores2008 is back in service [14:59:44] it was the SFP cable :( [15:13:48] Amir1: the patch is ready for all the wikis -> https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/948542 [15:13:48] I have manually checked some of them and are correct [15:14:22] I leave it up2u when you want to deploy it, just let me know so I can monitor [15:15:35] one thing is that this is not fixed yet https://it.wikipedia.org/wiki/Speciale:ORESModels, I am working on it [15:28:00] hm when I run it locally it runs. Will do some more digging on it tomorrow. I am going afk. ping me if you are going to deploy! [15:28:01] o/ [15:46:47] going afk as well! [15:46:48] o/ [15:58:42] 10Machine-Learning-Team, 10SRE, 10ops-codfw: Check ores2008's cable - https://phabricator.wikimedia.org/T345233 (10calbon) @Jhancock.wm Thanks! [16:22:19] isaranto: hi, sorry I was afk, I can deploy it now if you're okay with it [16:23:29] Yes I am ok! [16:24:13] cool [16:37:13] isaranto: deployed [16:38:48] \o/ [16:38:54] 🤞 [16:50:04] I am doing some checks, starting with enwiki all numerical values are exactly the same. There is however one difference in goodfaith thresholds. It has to do with the order : [16:50:04] https://phabricator.wikimedia.org/P52115 [16:52:46] the order doesn't matter there [16:54:30] I checked other wikis, you are right [16:54:45] all good as far as I can tell! Thank you Amir1: ! [16:55:05] awesome. Tomorrow turn them back on again on some wikis? [16:55:45] yes! I'll prepare the patch and ping you again [22:40:49] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Wikimedia-production-error: MWException: Default '"soft"' is invalid for preference oresDamagingPref of user نعمان حمداوي - https://phabricator.wikimedia.org/T345305 (10Krinkle)