[03:31:00] 10Machine-Learning-Team: Add unique URL to each answer in WikiGPT - https://phabricator.wikimedia.org/T329003 (10kevinbazira) [07:18:50] elukey: o/ I'd like to help in testing for streams as well! :) [07:59:44] 10Machine-Learning-Team: Add more articles to he corpus (instead of 1) - https://phabricator.wikimedia.org/T329016 (10isarantopoulos) [08:00:00] 10Machine-Learning-Team: Add more articles to the corpus (instead of 1) - https://phabricator.wikimedia.org/T329016 (10isarantopoulos) [08:00:47] 10Machine-Learning-Team: [WikiGPT] Add more articles to the corpus (instead of 1) - https://phabricator.wikimedia.org/T329016 (10isarantopoulos) [08:29:45] hi folks! [08:29:53] aiko, isaranto thanks! [08:34:42] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) Staging the new version on the switches: `asw-a-codfw> request system software add force-host set [ /var/tmp/jinstall-ex-4300-... [08:35:10] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) All models deployed except revert-risk, but it is a special set up that needs a bit more time. For the purpose of the upgrade everything is done! [08:38:23] elukey: ping us anytime [08:44:40] isaranto: o/ my idea is that I'll start with the first stream, and other folks will add more (so they'll be able to see the whole pipeline) [08:45:57] ack [09:00:46] I am currently writing some docs on Wikitech with the steps for any ML team member to test/add a new stream in Change-prop [09:31:57] super helpful, thanks! [09:39:24] https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Streams_(Admins_only,_Machine_Learning_team) [09:39:27] there you go [09:42:56] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10elukey) Created some docs to implement and test the new stream in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Streams_... [10:32:47] 10Lift-Wing, 10Machine-Learning-Team: Explore ingress filtering for Lift Wing - https://phabricator.wikimedia.org/T300259 (10elukey) Added the rate limits panel to [[ https://grafana-rw.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?forceLogin&from=now-6h&orgId=1&to=now&var-backend=All&var-cluster=codfw%20prometheus%... [10:35:31] elukey: o/ had a look at the docs. really nice, thanks! One q - the liftwing.test-events can be reused for testing for any new streams that listen to mediawiki.revision-create, right? [10:36:29] aiko: yes yes it can, the only caveat (now that I think about it) is that multiple changeprop rules will be likely triggered, so you may end up seeing multiple http requests [10:37:05] the alternative is that we keep few staging configs, and we remove the old ones if not needed [10:38:48] yeah that sounds reasonable [10:39:20] removing old ones that we've tested [10:42:17] elukey: I tried to use kafkacat -C to see what's in liftwing.test-events and found out I need to add "staging." before the topic name so it can work [10:42:24] 10Machine-Learning-Team: Create repository for WikiGPT - https://phabricator.wikimedia.org/T329028 (10isarantopoulos) [10:42:31] -t staging.liftwing.test-events [10:42:49] ah yes good point! I forgot about it, can you add it to the notes? [10:43:10] ok :) [10:43:15] basically changeprop uses either "(eqiad|codfw)." for prod, and staging. for staging [10:43:18] just came up to mind [10:45:47] 10Machine-Learning-Team: Create repository for WikiGPT - https://phabricator.wikimedia.org/T329028 (10Aklapper) Related, are there also plans to create a dedicated Phabricator project tag for this codebase? [10:47:38] done :) [10:50:51] <3 [11:08:13] 10Machine-Learning-Team: Upgrade the inference-services repo codebase to kserve 0.10 (fastapi) - https://phabricator.wikimedia.org/T329032 (10elukey) [11:08:40] created also the task to upgrade inference-services to kserve 0.10 [11:11:26] 10Machine-Learning-Team: Create repository for WikiGPT - https://phabricator.wikimedia.org/T329028 (10isarantopoulos) Created the repository https://gitlab.wikimedia.org/toolforge-repos/wiki-gpt through toolforge and added the code that is checked out (everything except the secrets and the database). For now if... [11:18:17] I created the repo and adde the team's group as maintainers (for some reason couldn't add it as owners) lemme know if u have access [11:21:07] 10Machine-Learning-Team: Create repository for WikiGPT - https://phabricator.wikimedia.org/T329028 (10kevinbazira) Thank you for working on this, @isarantopoulos! [11:22:04] isaranto: I was able to clone it [11:23:12] sry I was meaning mostly if you have access to repo settings etc (as maintainer). otherwise the repo is public [11:31:37] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jbond) [11:35:27] isaranto: yep yep I have access to them, just registered my 2fa etc.. (didn't do it before) [11:38:44] 10Machine-Learning-Team: Upgrade the inference-services repo codebase to kserve 0.10 (fastapi) - https://phabricator.wikimedia.org/T329032 (10achou) @elukey Do these model servers also need to be upgraded to bullseye and python 3.9? [12:00:14] * elukey lunch! [12:01:32] 10Machine-Learning-Team: Upgrade the inference-services repo codebase to kserve 0.10 (fastapi) - https://phabricator.wikimedia.org/T329032 (10elukey) @achou yep yep, there should be some tasks related to that move in the backlog. We can probably couple the migrations in one, I'll let people decide :) [12:03:42] 10Machine-Learning-Team: Get a GPU on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) I had a chat with SRE today, and they pointed me to https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi, so the MIG technology from Nvidia is definitely able to share GPUs across containers. [12:04:13] https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi - very interesting! [12:09:18] * isaranto lunch o clock! [12:17:52] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [12:26:30] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ssingh) [12:39:14] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Vgutierrez) [12:41:00] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) To depool all services in codfw we will just need to run: ` sudo cookbook sre.discovery.datacenter-route --reason 'T327925' depoo... [12:41:41] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) @Joe @akosiaris I assume we'll depool codfw for this one too? [12:46:07] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) Please note: this won't depool `docker-registry`, which will still be active in codfw for the duration of the maintenance. [13:24:21] (03CR) 10Ilias Sarantopoulos: [V: 03+2 C: 03+2] test: liftwing manual testing on deployment server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/884292 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos) [13:26:46] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [13:33:49] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) For the record, full row hosts downtime done with: `sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row A upgrade" -t T32... [13:34:16] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=295bf4d5-8856-488b-9ca9-06a0ff06db18) set by ayounsi@cumin1001 for 2:0... [13:34:57] elukey: \o Is there anything we need to do for the swicth upgrade thing or will the depool/repool happen automatically? [13:57:50] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) @elukey @achou as noted in https://phabricator.wikimedia.org/T301878#8008932, it would be better if new streams like this were... [14:06:08] 👀 no one is in the Codfw ROW A switch maintenance meeting ...👀 [14:07:41] kevinbazira: I think this meeting was set as a reminder - elukey: correct me if I'm wrong here [14:11:23] Correct [14:11:46] the actuall comms for teh maintenance is done over in #wikimedia-sre [14:12:02] klausman: here sorry, in theory I added the commands to depool the two ores nodes, not really needed probably [14:12:14] I did it anyway, just in case. [14:12:15] kevinbazira: sorry for the confusion, it was just a reminder! [14:12:18] super thanks [14:12:39] no problem. thank you for the clarification. [14:15:14] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10akosiaris) >>! In T327991#8593396, @Marostegui wrote: > @Joe @akosiaris I assume we'll depool codfw for this one too? Yeah, as a team... [14:21:13] 10Machine-Learning-Team: [WikiGPT] Add more articles to the corpus (instead of 1) - https://phabricator.wikimedia.org/T329016 (10isarantopoulos) I fed into our prompt the first 3 articles returned from a search on wikipedia. However this seems to bring some confusion as some articles may be irrelevant {F36782979... [14:21:39] elukey: I'll make sure to repool the ores nodes after maint. is done [14:21:45] super thanks [14:23:53] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10elukey) @Ottomata sure it shouldn't be a big problem, is there an ETA for the page_change stream to be live? (just to figure out how muc... [14:25:52] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey) [14:25:56] 10Machine-Learning-Team, 10Patch-For-Review: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 (10elukey) [14:38:38] 10Machine-Learning-Team: Add unique URL to each answer in WikiGPT - https://phabricator.wikimedia.org/T329003 (10kevinbazira) A unique URL has been added for each WikiGPT search result. This URL is stored in a database and shown on the frontend in an element that enables users to copy it: {F36783019} When the... [14:43:42] Services on ores200[12] are pooled again [14:44:23] super [14:50:51] 10Machine-Learning-Team: [WikiGPT] Use moderation API from OpenAI - https://phabricator.wikimedia.org/T329058 (10isarantopoulos) [14:56:45] TIL there is already a gitlab integration with phabricator -> https://wikitech.wikimedia.org/wiki/GitLab/Phabricator_integration [14:56:50] nice! [15:27:24] 10Machine-Learning-Team: Upgrade the inference-services repo codebase to kserve 0.10 (fastapi) - https://phabricator.wikimedia.org/T329032 (10isarantopoulos) In this context, after we upgrade we can check it there is a swagger UI available for the model servers (which comes bundled with fastAPI https://fastapi.t... [15:28:50] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Clement_Goubert) [15:29:49] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Clement_Goubert) [15:30:14] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Clement_Goubert) [15:30:28] 10Machine-Learning-Team: Investigate procuring and installing two GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10calbon) [15:31:20] 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10calbon) [15:34:32] 10Machine-Learning-Team: Investigate procuring and installing two GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10calbon) Status: waiting for GPUs to be moved from Hadoop Cluster to DSE Cluster and seeing if we can experiment on that. [15:39:25] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) 05Open→03Resolved a:03ayounsi The upgrade was smooth, ~15min hard downtime. No user impact, all the depools did their jo... [15:41:37] 10Machine-Learning-Team: Add basic explainability to WikiGPT - https://phabricator.wikimedia.org/T328638 (10calbon) 05Open→03Resolved [15:41:40] 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10calbon) [15:49:27] 10Machine-Learning-Team: [Liftwing testing] - Post deployment testing - https://phabricator.wikimedia.org/T327787 (10isarantopoulos) Added the documentation for running the tests in the README file. https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/test/... [15:49:57] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10colewhite) [15:52:09] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) 05Open→03Resolved [15:52:15] 10Lift-Wing, 10Machine-Learning-Team: Explore ingress filtering for Lift Wing - https://phabricator.wikimedia.org/T300259 (10elukey) 05Open→03Resolved [15:52:17] 10Lift-Wing, 10Machine-Learning-Team, 10Epic: API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10elukey) [15:53:15] 10Machine-Learning-Team: Investigate if the mediawiki.revision-score stream can be broken down into multiple ones with ChangeProp - https://phabricator.wikimedia.org/T327302 (10elukey) 05Open→03Resolved [15:54:42] 10Lift-Wing, 10Machine-Learning-Team: Deploy MultilingualRevertRiskModel to production - https://phabricator.wikimedia.org/T325218 (10elukey) 05Open→03Resolved [15:54:50] 10Machine-Learning-Team, 10Patch-For-Review: [revscoring] Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T325657 (10calbon) 05Open→03Resolved [15:54:56] 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10elukey) [15:55:01] 10Machine-Learning-Team, 10Patch-For-Review: Add access restriction to WikiGPT - https://phabricator.wikimedia.org/T328526 (10elukey) 05Open→03Resolved [15:55:05] 10Machine-Learning-Team: Remove additional drafttopic deployments - https://phabricator.wikimedia.org/T328916 (10elukey) 05Open→03Resolved [15:55:51] 10Lift-Wing, 10Machine-Learning-Team: Test MultilingualRevertRiskModel inference service on ml-sandbox - https://phabricator.wikimedia.org/T323613 (10elukey) 05Open→03Resolved [15:59:41] 10Lift-Wing, 10Machine-Learning-Team: Fix mwapi host header issue for outlink model server - https://phabricator.wikimedia.org/T325199 (10elukey) 05Open→03Resolved [15:59:51] 10Machine-Learning-Team: Fix translatewiki-reverted and frwikisource-articlequality isvcs - https://phabricator.wikimedia.org/T324567 (10elukey) 05Open→03Resolved [16:00:06] 10Lift-Wing, 10Machine-Learning-Team: Deploy revert-risk-model to production - https://phabricator.wikimedia.org/T321594 (10elukey) 05Open→03Resolved [16:00:18] 10Lift-Wing, 10Machine-Learning-Team, 10Research: Upload new outlinks topic model to LiftWing - https://phabricator.wikimedia.org/T322881 (10elukey) 05Open→03Resolved [16:00:30] 10Machine-Learning-Team: Remove hack from ML's blubber files - https://phabricator.wikimedia.org/T324658 (10elukey) 05Open→03Resolved [16:02:21] isaranto: so https://phabricator.wikimedia.org/T312518 is my understanding of the whole migration [16:02:32] I tried to dump all the things that I can think of, subtasks etc.. [16:02:56] it is probably missing something, plus of course the work to track down external bots and ask them to migrate to the api-gateway endpoint [16:02:59] etc.. [16:03:05] 10Machine-Learning-Team: [Liftwing testing] - Post deployment testing - https://phabricator.wikimedia.org/T327787 (10isarantopoulos) 05Open→03Resolved [16:04:11] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey) [16:04:39] taking a break, bbiab [16:04:57] 10Machine-Learning-Team: Investigate procuring and installing two GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10Isaac) > I had a chat with SRE today, and they pointed me to https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi, so the MIG technology from Nvidia is definitely able t... [16:05:26] elukey: thanks! [16:09:26] * isaranto afk/taking a break [16:19:49] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) p:05Triage→03Medium [16:21:54] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) [16:25:21] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) [16:30:01] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10colewhite) [16:56:35] 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search: Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10Isaac) thanks for the extra details @achou and @dcausse! the clear delineation of outlinks for namespace 0 and drafttopic for draft namespace... [17:28:46] going afk! Have a nice rest of the day folks [17:48:32] o/ [17:48:45] same here, cu tomorrow folks! [21:55:03] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) live on all wikis: end of quarter if all goes well. live with any reliability promises: TBD