[06:36:50] Good morning. [06:39:49] good morning [06:46:24] morning! [06:47:49] isaranto: good morning I saw that you have scheduled a backport deployment for today [06:48:52] moorning! [06:50:20] yes!! [06:58:28] isaranto: Lets do it together in a call [06:59:10] ack! [06:59:24] I'm going to ping you when it is time [07:00:43] ty [07:52:03] morning folks [07:57:21] (03PS4) 10AikoChou: edit-check: add metadata to model input base on env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1161574 (https://phabricator.wikimedia.org/T397013) [08:00:38] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 4 others: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824#10945980 (10isarantopoulos) [08:01:26] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 4 others: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824#10945984 (10isarantopoulos) The filters have been successfully deploye... [08:08:21] hey folks! [08:08:30] I'd need to test one thing in the staging cluster [08:08:35] a knative option, registries-skipping-tag-resolving [08:08:57] we are doing a hackathon this week to extend debmonitor for k8s [08:09:15] but due to a peculiarity from knative, we end up in some troubles (more info in https://phabricator.wikimedia.org/T397696#10943466) [08:09:54] I think that we don't need that functionality, we don't really use mutable tags in the docker registry (so we don't ever change a tag already pushed to point to another image) [08:12:21] testing it now [08:19:40] klausman: what should we do with signing key error, ref: https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1163291 [08:24:38] kart_: o/ I think that the issue is using an https:// prefix, it should be s3:// [08:24:41] see https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_deploy [08:24:51] (what we use on stat100x) [08:25:54] lemme test it [08:26:06] cool. Thanks. [08:29:24] ah no my bad, we do have https in our configs for ml [08:29:26] very weird [08:30:38] Should we remove https from URLs since, `use_https = True` is set? [08:33:21] mmm no in theory no [08:35:31] Okay. [08:35:51] Any more info we can get from -d debug flag? [08:35:52] I am trying https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_deploy on stat1011 and it hangs [08:36:04] that is strange (completely different creds) [08:36:29] kart_: yes! An useful one [08:36:30] DEBUG: ConnMan.get(): creating new connection: https://s3.amazonaws.com [08:36:31] lol [08:36:38] :D [08:38:45] ah right permissions, okok [08:38:52] I am not in ml-admins anymore [08:38:59] very intuitive [08:42:14] b'nHeaderMalformedThe authorization header is malforme' [08:42:17] b"d; the region 'US' is wrong; expecting 'us-east-1' mmm have we ever tried these credentials before? [08:51:12] No idea :/ klausman ? [08:53:22] AfK for sometime, will followup once back. [08:56:55] mmm https://phabricator.wikimedia.org/T311628#8050691 came up to mind [08:57:01] wondering if data persistence did it [08:57:03] * elukey asks [09:09:22] o/ I'd like a review https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1161574 for tone check's model input whenever anyone has time :) thanks! [09:15:12] (03CR) 10Gkyziridis: [C:03+1] "Thank you for working on this!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1161574 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [09:26:49] (sorry, was running an errand) No, I have not tried the credentials in any way except for what I showed in the comments on the change [09:26:53] Let me dig a bit [09:27:28] klausman: DP never ran https://phabricator.wikimedia.org/T311628#8050691, I fell in the same trap before [09:27:37] lemme try to run the command to see if it works afterwards [09:27:59] ah, so that's a required step to make an S3 account work? [09:28:19] yeah [09:35:33] aah! [09:35:47] elukey: at least *I* have been missing the :prod suffix on the credential [09:36:04] That at least changes the error to: [09:36:10] ERROR: Bucket 'wmf-ml-models' does not exist [09:37:04] I took a peek at /etc/s3cmd/cfg.d/ml-team.cfg and that has the :prod suffix [09:38:15] yeah but IIRC Matthew created it without :prod [09:38:19] but let's double check [09:38:45] (I also tried a suffix of :ro, that has the same outocme as no suffix) [09:41:03] The priavte repo does not have the :prod suffix (the ML credentials do). But I don't know if it conveys any meaning for Swift, or if it's just a covnention [09:44:06] nono it is our convention afaik [09:50:09] tried to run the command but nothing changed mmm [09:50:44] with or wiothout :prod? [09:51:02] (or put another way: what username does Swift think we should be using?) [09:52:06] because hieradata/common/profile/thanos/swift.yaml has the :prod suffix [09:52:13] (ofr MT) [09:52:15] for8 [10:10:37] back sorry [10:11:16] oh ok so in private we have the wrong one [10:11:22] sigh [10:33:44] o/ I have adapted Bartosz's work from T393865 and used it in the ml-pipelines repo. here is the MR: [10:33:44] https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/3 [10:33:44] bartosz: I wanted to add you as a reviewer for --^ but I am not sure whether you have an account on gitlab.wikimedia.org. if you do, please share your username. thanks! [10:41:00] thanks for the review georgekyz. Bartosz nvm, we've merged the change. [10:49:27] elukey: I can fix the private repo. But even if I use the :prod suffix on stat1011, I still don't see any buckets [10:53:31] Both private repos have been fixed (:prod added) [11:01:44] FIRING: LiftWingServiceErrorRate: ... [11:01:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:02:11] 06Machine-Learning-Team, 13Patch-For-Review: Build model training pipeline using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#10946678 (10kevinbazira) I have adapted the pre-commit and ruff setup from T393865 (used in our LiftWing isvcs repo) to help us maintain consistent code style and... [11:26:44] RESOLVED: LiftWingServiceErrorRate: ... [11:26:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:34:48] klausman: should this be merge to test s3cmd? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1162900 [11:35:48] yes [11:36:56] Let me await CI tests and I'll merge it. [11:49:43] merged and applied to the clusters [11:55:39] cool [11:57:44] FIRING: LiftWingServiceErrorRate: ... [11:57:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:57:55] klausman: should we try testing again? [11:58:35] I doubt it would work, since manual testing from stat1011 still doesn't show any buckets. [11:59:31] elukey: Do we need to rerun https://phabricator.wikimedia.org/T311628#8050691 this stuff but for the :prod account? [12:27:44] RESOLVED: LiftWingServiceErrorRate: ... [12:27:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:39:06] klausman: before changing stuff to other systems make sure that you sync with the service owner.. I pinged Matthew on DP ealier on, when you modify an account it will likely need a roll restart of the swift frontends on thanos [12:40:19] Yes, I found that info on WT, and asked in o11y if roll-restarting is ok [12:43:12] sure but the owner of the platform is data persistence :) [12:43:24] have you already restarted the daemons? [13:00:42] klausman: --^ [13:00:51] nope [13:01:00] I am about to go into a meeting, so I haven't proceeded [13:01:20] okok [13:48:44] FIRING: LiftWingServiceErrorRate: ... [13:48:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:16:34] (03CR) 10AikoChou: [C:03+2] "Thanks for the review! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1161574 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [14:25:04] (03Merged) 10jenkins-bot: edit-check: add metadata to model input base on env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1161574 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [14:59:19] I have silenced the above alert for a couple of days and add an action item to investigate this [15:36:41] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog, 05WE4.2 Anti-abuse: Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#10947770 (10Trizek-WMF) >>!... [15:56:15] hi, i am from wikinews, i would like to request assistance with adopting the ml infrastructure from WMF for two purposes. one is to search the web and determine which events are current. antoher is for a given event, find appropriate URLs as sources, and identify 5Ws (who what where why when how) and write a paragraph summarizing this information. i previously tried doing this with chatgpt but it makes stuff up based on sources which it does not always list. [15:56:21] and i think it would be better if the utility was not hosted externally. i'm happy to connect with audio or irc at a time that is convenient to you. please advise. [15:56:52] unlike wikipedia, at wikinews the highly local events are acceptable, as are photo essays. the notability is 'if it is fresh and is not a personal promotion then it passes'. [16:22:50] o/ gry thanks for reaching out! The above use cases sound interesting. I don't think we have something to support your use cases at the moment but I’d be happy to chat more about it. Let’s find a time to connect , IRC or a audio works for me. Sending you a DM about it! [16:26:58] hi isaranto [17:04:46] * isaranto afk!