[06:36:50] <ozge_>	 Good morning.
[06:39:49] <georgekyz>	 good morning 
[06:46:24] <bartosz>	 morning! 
[06:47:49] <georgekyz>	 isaranto: good morning I saw that you have scheduled a backport deployment for today
[06:48:52] <isaranto>	 moorning!
[06:50:20] <isaranto>	 yes!!
[06:58:28] <georgekyz>	 isaranto: Lets do it together in a call
[06:59:10] <isaranto>	 ack!
[06:59:24] <isaranto>	 I'm going to ping you when it is time
[07:00:43] <georgekyz>	 ty
[07:52:03] <aiko>	 morning folks
[07:57:21] <wikibugs>	 (03PS4) 10AikoChou: edit-check: add metadata to model input base on env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1161574 (https://phabricator.wikimedia.org/T397013)
[08:00:38] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 4 others: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824#10945980 (10isarantopoulos)
[08:01:26] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 4 others: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824#10945984 (10isarantopoulos) The filters have been successfully deploye...
[08:08:21] <elukey>	 hey folks!
[08:08:30] <elukey>	 I'd need to test one thing in the staging cluster
[08:08:35] <elukey>	 a knative option, registries-skipping-tag-resolving
[08:08:57] <elukey>	 we are doing a hackathon this week to extend debmonitor for k8s
[08:09:15] <elukey>	 but due to a peculiarity from knative, we end up in some troubles (more info in https://phabricator.wikimedia.org/T397696#10943466)
[08:09:54] <elukey>	 I think that we don't need that functionality, we don't really use mutable tags in the docker registry (so we don't ever change a tag already pushed to point to another image)
[08:12:21] <elukey>	 testing it now
[08:19:40] <kart_>	 klausman: what should we do with signing key error, ref: https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1163291
[08:24:38] <elukey>	 kart_: o/ I think that the issue is using an https:// prefix, it should be s3://
[08:24:41] <elukey>	 see https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_deploy
[08:24:51] <elukey>	 (what we use on stat100x)
[08:25:54] <elukey>	 lemme test it
[08:26:06] <kart_>	 cool. Thanks.
[08:29:24] <elukey>	 ah no my bad, we do have https in our configs for ml
[08:29:26] <elukey>	 very weird
[08:30:38] <kart_>	 Should we remove https from URLs since, `use_https = True` is set?
[08:33:21] <elukey>	 mmm no in theory no
[08:35:31] <kart_>	 Okay. 
[08:35:51] <kart_>	 Any more info we can get from -d debug flag?
[08:35:52] <elukey>	 I am trying https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_deploy on stat1011 and it hangs
[08:36:04] <elukey>	 that is strange (completely different creds)
[08:36:29] <elukey>	 kart_: yes! An useful one
[08:36:30] <elukey>	 DEBUG: ConnMan.get(): creating new connection: https://s3.amazonaws.com
[08:36:31] <elukey>	 lol
[08:36:38] <kart_>	 :D
[08:38:45] <elukey>	 ah right permissions, okok
[08:38:52] <elukey>	 I am not in ml-admins anymore
[08:38:59] <elukey>	 very intuitive
[08:42:14] <elukey>	          b'nHeaderMalformed</Code><Message>The authorization header is malforme'
[08:42:17] <elukey>	          b"d; the region 'US' is wrong; expecting 'us-east-1'</Message><Request"
[08:48:07] <elukey>	 mmm have we ever tried these credentials before?
[08:51:12] <kart_>	 No idea :/ klausman ?
[08:53:22] <kart_>	 AfK for sometime, will followup once back.
[08:56:55] <elukey>	 mmm https://phabricator.wikimedia.org/T311628#8050691 came up to mind
[08:57:01] <elukey>	 wondering if data persistence did it
[08:57:03] * elukey asks
[09:09:22] <aiko>	 o/ I'd like a review https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1161574 for tone check's model input whenever anyone has time :) thanks!
[09:15:12] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "Thank you for working on this!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1161574 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou)
[09:26:49] <klausman>	 (sorry, was running an errand) No, I have not tried the credentials in any way except for what I showed in the comments on the change
[09:26:53] <klausman>	 Let me dig a bit
[09:27:28] <elukey>	 klausman: DP never ran https://phabricator.wikimedia.org/T311628#8050691, I fell in the same trap before
[09:27:37] <elukey>	 lemme try to run the command to see if it works afterwards
[09:27:59] <klausman>	 ah, so that's a required step to make an S3 account work?
[09:28:19] <elukey>	 yeah
[09:35:33] <klausman>	 aah!
[09:35:47] <klausman>	 elukey: at least *I* have been missing the :prod suffix on the credential
[09:36:04] <klausman>	 That at least changes the error to:
[09:36:10] <klausman>	 ERROR: Bucket 'wmf-ml-models' does not exist
[09:37:04] <klausman>	 I took a peek at /etc/s3cmd/cfg.d/ml-team.cfg and that has the :prod suffix
[09:38:15] <elukey>	 yeah but IIRC Matthew created it without :prod
[09:38:19] <elukey>	 but let's double check
[09:38:45] <klausman>	 (I also tried a suffix of :ro, that has the same outocme as no suffix)
[09:41:03] <klausman>	 The priavte repo does not have the :prod suffix (the ML credentials do). But I don't know if it conveys any meaning for Swift, or if it's just a covnention
[09:44:06] <elukey>	 nono it is our convention afaik
[09:50:09] <elukey>	 tried to run the command but nothing changed mmm
[09:50:44] <klausman>	 with or wiothout :prod?
[09:51:02] <klausman>	 (or put another way: what username does Swift think we should be using?)
[09:52:06] <klausman>	 because hieradata/common/profile/thanos/swift.yaml has the :prod suffix
[09:52:13] <klausman>	 (ofr MT)
[09:52:15] <klausman>	 for8
[10:10:37] <elukey>	 back sorry
[10:11:16] <elukey>	 oh ok so in private we have the wrong one
[10:11:22] <elukey>	 sigh
[10:33:44] <kevinbazira>	 o/ I have adapted Bartosz's work from T393865 and used it in the ml-pipelines repo. here is the MR:
[10:33:44] <kevinbazira>	 https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/3
[10:33:44] <kevinbazira>	 bartosz: I wanted to add you as a reviewer for --^ but I am not sure whether you have an account on gitlab.wikimedia.org. if you do, please share your username. thanks!
[10:41:00] <kevinbazira>	 thanks for the review georgekyz. Bartosz nvm, we've merged the change.
[10:49:27] <klausman>	 elukey: I can fix the private repo. But even if I use the :prod suffix on stat1011, I still don't see any buckets
[10:53:31] <klausman>	 Both private repos have been fixed (:prod added)
[11:01:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[11:01:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[11:02:11] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Build model training pipeline using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#10946678 (10kevinbazira) I have adapted the pre-commit and ruff setup from T393865 (used in our LiftWing isvcs repo) to help us maintain consistent code style and...
[11:26:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[11:26:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[11:34:48] <kart_>	 klausman: should this be merge to test s3cmd? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1162900
[11:35:48] <klausman>	 yes
[11:36:56] <klausman>	 Let me await CI tests and I'll merge it.
[11:49:43] <klausman>	 merged and applied to the clusters
[11:55:39] <kart_>	 cool
[11:57:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[11:57:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[11:57:55] <kart_>	 klausman: should we try testing again?
[11:58:35] <klausman>	 I doubt it would work, since manual testing from stat1011 still doesn't show any buckets.
[11:59:31] <klausman>	 elukey: Do we need to rerun https://phabricator.wikimedia.org/T311628#8050691 this stuff but for the :prod account?
[12:27:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[12:27:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[12:39:06] <elukey>	 klausman: before changing stuff to other systems make sure that you sync with the service owner.. I pinged Matthew on DP ealier on, when you modify an account it will likely need a roll restart of the swift frontends on thanos
[12:40:19] <klausman>	 Yes, I found that info on WT, and asked in o11y if roll-restarting is ok
[12:43:12] <elukey>	 sure but the owner of the platform is data persistence :)
[12:43:24] <elukey>	 have you already restarted the daemons?
[13:00:42] <elukey>	 klausman: --^
[13:00:51] <klausman>	 nope
[13:01:00] <klausman>	 I am about to go into a meeting, so I haven't proceeded
[13:01:20] <elukey>	 okok 
[13:48:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[13:48:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[14:16:34] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] "Thanks for the review! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1161574 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou)
[14:25:04] <wikibugs>	 (03Merged) 10jenkins-bot: edit-check: add metadata to model input base on env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1161574 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou)
[14:59:19] <isaranto>	 I have silenced the above alert for a couple of days and add an action item to investigate this
[15:36:41] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog, 05WE4.2 Anti-abuse: Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#10947770 (10Trizek-WMF) >>!...
[15:56:15] <gry>	 hi, i am from wikinews, i would like to request assistance with adopting the ml infrastructure from WMF for two purposes. one is to search the web and determine which events are current. antoher is for a given event, find appropriate URLs as sources, and identify 5Ws (who what where why when how) and write a paragraph summarizing this information. i previously tried doing this with chatgpt but it makes stuff up based on sources which it does not always list. 
[15:56:21] <gry>	 and i think it would be better if the utility was not hosted externally. i'm happy to connect with audio or irc at a time that is convenient to you. please advise.
[15:56:52] <gry>	 unlike wikipedia, at wikinews the highly local events are acceptable, as are photo essays. the notability is 'if it is fresh and is not a personal promotion then it passes'.
[16:22:50] <isaranto>	 o/ gry thanks for reaching out! The above use cases sound interesting. I don't think we have something to support your use cases at the moment but I’d be happy to chat more about it. Let’s find a time to connect , IRC or a audio works for me. Sending you a DM about it!
[16:26:58] <gry>	 hi isaranto
[17:04:46] * isaranto afk!