[04:34:06] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 18th round of wikis - https://phabricator.wikimedia.org/T308144 (10kevinbazira) Model evaluation has been completed and below are the backtesting results: | | Precision@0.5 | Recall@0.5 |dewiki | 0.79 | 0.48 |enwiki |... [04:34:45] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 18th round of wikis - https://phabricator.wikimedia.org/T308144 (10kevinbazira) [06:23:38] Good morning! ☀️ (although it is not sunny here :P ) [06:43:02] o/ [06:43:06] same thing in bologna :) [06:48:06] 10Machine-Learning-Team: Create k8s ingress config and VIP for ores-legacy - https://phabricator.wikimedia.org/T336726 (10elukey) [06:57:53] I am deploying ores-legacy to production, to kick off the work for the k8s ingress etc.. [06:59:27] ack [07:17:53] 10Machine-Learning-Team: Create k8s ingress config and VIP for ores-legacy - https://phabricator.wikimedia.org/T336726 (10elukey) [07:18:27] going to a doc apt, back in a bit! [08:14:45] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:16:39] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:17:27] klausman: o/ but I don't see the revertrisk subdirectory in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/ml-services/ ? [08:17:34] elukey: \o [08:18:06] oops! [08:18:33] making a patch :) [08:19:15] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:24:11] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:31:21] aiko: So about the S3 buckets, on the experiment side we have rr-ml rr-wd and plain rr (in s3://wmf-ml-models/experimental/). But on the new path you mentioned in T333124, there is only rr-la and rr-ml. Which one is the one to use for the base rr model in prod? [08:32:07] (I presume the rr-ml service uses the ml-rr s3 bucket, but what about the rr model? Or do we only one the rr-ml service in prod? [08:32:10] ) [08:36:07] klausman: rr-la is the plain rr (version 1) [08:37:03] alright, thanks [08:43:06] aiko: should I also bump the docker image versions? The experimental config still has them at 2023-03-08-093615-publish (la) and 2023-03-20-162345-publish (ml) [08:43:55] But there are newer versions listed on https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-revertrisk,-multilingual/tags/ (and the base image) [08:54:31] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 65 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10TheresNoTime) [09:04:13] klausman: no need to bump the docker image versions coz there is no change to rr-la and rr-ml in the newer versions [09:04:20] ack [09:04:29] Now I just need to figure out a YAML error :-/ [09:10:38] There we go. [09:14:16] elukey: I'll also handle the codfw switch maintenance stuff (ORES depool/pool) [09:26:06] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10hnowlan) [09:35:51] klausman: nice thanks! [09:37:21] (more than an hour of delay for the doc appt, very nice) [09:37:32] The Waiting Room Blues [09:38:18] I am _so_ glad these days one can at least read on the phone. I wouldn't touch waiting room magazines with a ten-foot pole. [09:38:58] Once, during Covid(!), I saw a woman page through a magazine and licking her finger for page turning. Gave me the chills. [09:49:48] isn't it amazing that most magazines are old (issues from years ago) - at least that's what I see here oh and a new one here and there [09:53:11] klausman: I merged Aiko's change and now https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/920208 needs a rebase, sorry [09:53:14] should be a quick one [09:53:29] aiko: o/ you can deploy the revertrisk-wikidata changes to staging if you want [09:53:36] It's fine [09:53:42] I was well aware of that :) [10:05:20] I just created the TLS certs for ores-legacy.discovery.wmnet [10:05:45] going to deploy ores-legacy to prod and see if it now works fine [10:05:57] after that we should be able to create the ingress VIP as well [10:07:56] klausman: the certs is deployed by the mesh module configuration [10:08:00] so we have [10:08:12] 1) envoy running a tls proxy in front of uvicorn [10:08:34] 2) envoy running as proxy to call lift wing (uvicorn will see it as localhost:etc.. call) [10:08:47] basically the same as we have in wikikube, no istio injection of any sort [10:08:59] Very nice. [10:09:15] I presume we still have the forwarded-for info in the headers? [10:10:39] in theory yes but I didn't check, we'll discover it when setting up the access logging [10:12:07] ack [10:12:49] Do we have anything that consumes ORES (legacy) access logs beyond just logstash? I mean, anything that actually does anything programmatic/analytic with it? [10:13:59] not really, what we can do is to create a kibana dashboard like the ORES one [10:14:06] with breakdowns etc.. [10:14:08] should be sufficient [10:14:37] yeah, sounds good. I was just wondering if we need to consider such use cases when we start looking at the non-ORES access logs [10:15:51] if we send access logs in json format logstash should be able to parse them natively, and then building a dashboard should be quick [10:16:08] same thing for kserve (but we'd need to wait for kserve 0.11) [10:16:13] sgtm [10:26:37] * klausman lunch and groceries [10:30:03] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: cod... [10:46:41] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: cod... [10:54:22] ok I think I have all code reviews ready for the prod endpoint [11:04:43] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [11:21:41] wow, great job! [11:40:11] my reviews won't add much value so I'm just observing 👀 [11:40:16] * isaranto afk lunch [11:59:36] * elukey lunch [12:17:41] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ssingh) [12:19:01] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10Ottomata) [12:49:09] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3a841f97-aecd-4c7a-8eb4-8acd1caa15b3) set by ayounsi@cumin1001 for 2:... [12:53:15] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MatthewVernon) [12:54:07] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [13:28:01] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [13:39:17] elukey: switch upgrade went fine with no issues [13:54:34] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw... [14:10:23] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw... [14:24:47] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10herron) [14:26:36] 10Machine-Learning-Team, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgrade went very well. Thanks everybody! That was the last one! [14:32:31] 10Machine-Learning-Team, 10Epic: Experiment with GPUs in the Machine Learning infrastructure - https://phabricator.wikimedia.org/T333462 (10elukey) [16:20:58] * elukey afk! [16:31:03] 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10isarantopoulos) If we take on this project in the future I think that the two main building blocks are the following: - Semantic search using embeddings: retrieving the most relevant articles to the search qu... [16:34:27] since I worked on this topic this week, I wrote a summary on the closed WikiGPT task on some takeaways of things that can be done in the future (things regardless of wikigpt) https://phabricator.wikimedia.org/T328494#8856293 [16:50:57] revertrisk-wikidata has been deployed to staging [16:51:06] aikochou@deploy1002:~/rrr$ kubectl get pods [16:51:06] NAME READY STATUS RESTARTS AGE [16:51:06] revertrisk-wikidata-predictor-default-00001-deployment-6c5hn98g 3/3 Running 0 13m [16:54:14] but when I tested it via curl, I got error "curl: (7) Failed to connect to inference-staging.svc.codfw.wmnet port 30443: No route to host" [16:55:59] very weird.. also tested revertrisk-multilingual in staging, got the same error [16:56:05] it was working before [16:58:24] gonna look into this issue tomorrow :)