[07:36:12] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin2002 for host ml-serve1006.eqiad.wmnet with OS bullseye [07:53:24] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin2002 for host ml-serve1007.eqiad.wmnet with OS bullseye [07:58:51] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [08:18:58] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin2002 for host ml-serve1007.eqiad.wmnet with OS bullseye [08:19:08] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin2002 for host ml-serve1007.eqiad.wmnet with OS bullseye executed with errors: - ml-serve1007 (**FAIL**) - Remove... [08:38:51] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin2002 for host ml-serve1006.eqiad.wmnet with OS bullseye completed: - ml-serve1006 (**PASS**) - Removed from Pupp... [08:39:47] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10kevinbazira) The conclusion on the backtesting results is that most of the languages look fine besides: - hywiki with both a precision (0.74) and rec... [08:42:44] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10kevinbazira) [08:52:41] morning :) [09:09:11] o/ [09:10:40] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin2002 for host ml-serve1007.eqiad.wmnet with OS bullseye [09:13:40] o/ [09:20:25] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin2002 for host ml-serve1008.eqiad.wmnet with OS bullseye [09:21:12] isaranto: kalimera :D [09:26:55] buongiorno! [09:31:13] Guten Tag :) [09:42:32] 早安 :D [09:43:02] lín-hó [09:43:54] kevinbazira: Wasuze otya nno [09:45:00] isaranto: lol ... bulungi ssebo :D [09:48:09] 10Machine-Learning-Team: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10elukey) 05Open→03Resolved [09:48:12] 10Machine-Learning-Team: Upgrade ML clusters to Kubernetes 1.23 - https://phabricator.wikimedia.org/T324542 (10elukey) [09:49:01] 10Machine-Learning-Team: Fix feature to view old search results - https://phabricator.wikimedia.org/T329345 (10elukey) 05Open→03Resolved [09:49:03] 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10elukey) [09:49:11] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Kyrgyz Wikipedia model training pipeline failed - https://phabricator.wikimedia.org/T329817 (10elukey) 05Open→03Resolved [09:49:13] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10elukey) [09:49:19] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Fix Armenian sentence tokenization bug in the link recommendation algorithm - https://phabricator.wikimedia.org/T327371 (10elukey) 05Open→03Resolved [09:49:30] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10elukey) [09:49:34] 10Machine-Learning-Team: Move secret keys to constants in WikiGPT - https://phabricator.wikimedia.org/T329135 (10elukey) 05Open→03Resolved [09:49:36] 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10elukey) [09:49:38] 10Machine-Learning-Team: Add unique URL to each answer in WikiGPT - https://phabricator.wikimedia.org/T329003 (10elukey) 05Open→03Resolved [09:49:40] 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10elukey) [09:49:41] Wow, the TTS voice for Chinese is _very_ enthusiastic about saying good Morning :D [09:49:44] 10Lift-Wing, 10Machine-Learning-Team, 10Epic: Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) 05Open→03Resolved [09:49:51] 10Machine-Learning-Team: Upgrade ML clusters to Kubernetes 1.23 - https://phabricator.wikimedia.org/T324542 (10elukey) [09:49:54] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) 05Open→03Resolved [09:50:00] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) 05Open→03Resolved [09:51:09] On Google Translate, I mean [09:52:36] there is a lot of talk about whisper API (speech to text) introduced yesterday by OpenAI [09:53:38] they created an api for a model released in september https://openai.com/research/whisper . people say it works really good. [09:55:39] The examples are impressive [10:00:49] isaranto: in theory https://gerrit.wikimedia.org/r/c/operations/puppet/+/893672 should fix the ORES logstash dashboard, so we'll see more UAs during the next days [10:00:55] closer to what turnilo shows [10:03:14] nice elukey: ! [10:04:59] missed it the first time :( [10:07:13] klausman: I am running the provision cookbook for ml-serve100[678] (as suggested by Riccardo) to fix settings for boot [10:07:27] (and maybe others) [10:07:47] but after manual fixes in the bios they are all completing the reimage process [10:08:19] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin2002 for host ml-serve1007.eqiad.wmnet with OS bullseye completed: - ml-serve1007 (**PASS**) - Removed from Pupp... [10:08:31] Neat. Was is any kind of expected behavior? Or how did things get broken? [10:10:19] we have 10g NICs on the hosts and the first of them was set after disk for boot, meanwhile the correct one was the first 1g NIC [10:10:50] after setting it DHCP + PXE etc.. worked [10:10:52] Ah, so "drift" from hardware that wasn't common before? [10:11:16] I think it was a misconfiguration made when the host was bootstrapped the first time [10:11:30] but I don't have a ton of context on that side of the infra [10:11:58] Gotta love all these "hidden" config bits that never rear their head in everyday operation [10:21:29] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin2002 for host ml-serve1008.eqiad.wmnet with OS bullseye completed: - ml-serve1008 (**PASS**) - Removed from Pupp... [10:30:58] all ml-serve-eqiad hosts up with k8s 1.23, I'll deploy the model servers pod later on since I may need to run the provision cookbook more times (will wait for Riccardo's input) [10:31:04] but basically we should be good :) [10:42:00] klausman: I managed to get a bearer token and get a score from the api gateway, all good :) [10:42:13] Nice :) [10:43:17] I've been wondering whether it would be useful for debugging and the like to get some info in the response about what DC/Cluster and what pod answered a query, but I dunno if that would already expose too much info [10:43:54] In an ideal world, you'd be able to get a full RPC trace (like Dapper) but we're not there (YET) [10:45:11] definitely yes, the infra foundations team are planning to use the aux cluster for Jager [10:45:16] at least IIUC [10:45:52] Future Work™ [11:01:56] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) @ayounsi @akosiaris @Joe to confirm, we are going to depool eqiad before this maintenance like we've done in codfw right? [11:02:10] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [11:05:17] klausman: I don't recall - what did we decide for the bearer token? Was it needed to enforce any kind of rate limit or only as initial measure for the launch? [11:07:26] So the APIGW doesn't support using POST without a token [11:08:37] Hugh mentioned that removing that requirement was planned, but not a high priority [11:10:48] ah ok perfect [11:10:53] I'll add it to the task [11:11:40] As for the rate limits: some are in place, tied to tokens/identities (so people won't affect each other) as well as global limits per-service. [11:29:18] * elukey lunch! [11:45:32] * klausman lunch as well [12:12:34] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10akosiaris) >>! In T330165#8660042, @Marostegui wrote: > @ayounsi @akosiaris @Joe to confirm, we are going to depool eqiad before this m... [12:55:12] 10Machine-Learning-Team, 10ORES, 10Wikimedia-production-error: PHP Notice: Trying to access array offset on value of type null (in SpecialORESModels) - https://phabricator.wikimedia.org/T329304 (10jnuche) Still happening as of wmf.25 One such req from today: https://logstash.wikimedia.org/goto/137fa6559d46c... [13:02:27] * isaranto lunch [13:15:03] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:18:30] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) >>! In T330165#8660202, @akosiaris wrote: >>>! In T330165#8660042, @Marostegui wrote: >> @ayounsi @akosiaris @Joe to confir... [14:02:30] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10elukey) ` elukey@deploy1002:~$ httpbb --host inference.svc.eqiad.wmnet --https_port 30443 /srv/deployment/httpbb-tests/liftwing/test_liftwing_production.yaml Sending to inference.svc.eqiad.wmne... [14:03:08] 10Machine-Learning-Team: Upgrade ML clusters to Kubernetes 1.23 - https://phabricator.wikimedia.org/T324542 (10elukey) a:03elukey [14:03:23] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10elukey) a:03elukey [14:06:25] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou) @Isaac @Ottomata I dig a bit more into the event schema (https://schema.wikimedia.org/#!/) today and have some thoughts... [14:40:58] I'll probably submit a patch for the endpoint I am working on either today or tomorrow [14:41:14] otherwise it will be a huge patch for a review (it already is) [14:41:36] but I have some basic functionality and unit tests in place so I will just note what parts need to be expanded [14:55:53] nice yes makes sense! [15:49:56] isaranto: https://github.com/dennistobar/serobot/blob/master/main.py#L70 is interesting, the code should be the SeroBot UA that we see hitting ORES [15:50:14] it could be a good use case to be migrated to Lift Wing [15:50:34] the only caveat is that at the moment the API Gateway requires a bearer token [15:51:32] (so some basic auth) - once we figure that bit out, moving it to Lift Wing should be easy [15:51:37] the repo seems also very active [15:52:17] * elukey afk for a bit [15:52:35] note taken [15:52:53] I have an idea about tackling the bearer token issue [15:54:05] nevermind I had something different in mind :P [15:54:37] need to take a break I guess... but thanks Luca , will add this on the ticket [16:03:09] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) > Option 2 is more in line with the nature of the outlink topic model (link-based) since links change is the only ty... [16:11:06] 10Machine-Learning-Team, 10Data-Engineering, 10Edit-Review-Improvements-Integrated-Filters, 10Event-Platform Value Stream, and 2 others: Integration of Revert Risk Scores to Recent Changes as a filter - https://phabricator.wikimedia.org/T329071 (10Ottomata) [16:20:03] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream, 10Research: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10Ottomata) Relevant: T328899#8661226 We should all sync up and work on some big standardized modeling design dec... [16:20:20] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream, 10Research: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10Ottomata) [16:26:15] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream, 10Research: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10diego) > We should all sync up and work on some big standardized modeling design decisions and ideas. It would b... [16:29:04] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream, 10Research: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10Ottomata) +1 [16:42:14] 10Machine-Learning-Team: EnWiki Recent Changes Page no longer displays damaging filters - https://phabricator.wikimedia.org/T331045 (10calbon) [17:38:27] * elukey afk! o/