[00:20:32] 10Machine-Learning-Team, 10Growth-Team, 10PageTriage: Detection and flagging of articles that are AI/LLM-generated - https://phabricator.wikimedia.org/T330346 (10Tgr) >>! In T330346#8642831, @Novem_Linguae wrote: > https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Articles_for_creation#ChatGPT_and_oth... [03:59:58] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Northern Luri Wikipedia model training pipeline failed - https://phabricator.wikimedia.org/T330616 (10kevinbazira) [04:24:47] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Northern Luri Wikipedia model training pipeline failed - https://phabricator.wikimedia.org/T330616 (10Bugreporter) This Wikipedia is already closed. See {T272041}. [07:24:20] good morning :) [07:44:52] 10Machine-Learning-Team: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) Finally after a lot of digging I found the issue: the `ingressgateway` pods are missing a network policy to allow traffic to flow to port 8081. The `knative-local-gateway` replaces the... [07:45:10] klausman: o/ --^ is my best explanation for the 10s request mistery [07:53:56] and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/892353 should fix it [09:09:06] 10Lift-Wing, 10Machine-Learning-Team, 10Epic: Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [09:09:08] 10Lift-Wing, 10Machine-Learning-Team: Implement an online feature store - https://phabricator.wikimedia.org/T294434 (10elukey) [09:19:32] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [09:32:35] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w... [09:35:27] elukey: \o Having a look-see [09:39:23] Nice sleuthing! [09:40:21] \o/ [09:42:11] the PREROUTING chain is definitely something to keep in mind [09:42:16] tcpdump will be affected etc.. [09:42:36] Yeah, overlay networks are great when they work fine, but they turn into mud when they don't [10:01:19] 10Machine-Learning-Team: Automate the procedure to bootstrap minikube on the ML-Sandbox and to share it by multiple users - https://phabricator.wikimedia.org/T305447 (10elukey) Ilias created https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe/DeployLocal, we should probably use it as baseline to... [10:02:19] 10Machine-Learning-Team: Add documentation about LiftWing to the API Portal - https://phabricator.wikimedia.org/T325759 (10elukey) Opened T330634 to the API platform folks to grant us access to the API gateway portal. [10:26:41] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [10:43:39] aiko: o/ [10:50:07] the httpbb script for staging returns [10:50:07] https://revert-risk-model-predictor-default.experimental.wikimedia.org/v1/models/revert-risk-model:predict (test_liftwing_staging.yaml:100) Status code: expected 200, got 404. [10:50:20] I guess that this is due to the new split between RR model servers.. [10:50:25] can you check it when you have time? [11:06:03] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w... [11:06:12] elukey: o/ the model name changed to "revertrisk" from "revert-risk-model" [11:07:23] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [11:07:33] elukey: I'll send a patch to the httpbb script [11:07:37] <3 [11:20:26] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w... [11:36:44] * elukey lunch [11:52:05] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [12:04:59] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-... [12:21:39] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [12:38:54] 10Machine-Learning-Team, 10Data-Engineering, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10jbond) [13:28:33] 10Machine-Learning-Team, 10Language-Team, 10serviceops-radar: Hosting machine request for machine translation - https://phabricator.wikimedia.org/T329971 (10JMeybohm) [13:29:06] * klausman lunch [13:55:06] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10kevinbazira) 21/22 models were trained successfully in the 11th round of wikis. The Northern Luri Wikipedia (lrcwiki) pipeline did not complete succe... [13:57:02] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10kevinbazira) [14:14:50] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye [14:23:53] Goood morning all! [14:23:58] I’m back! [14:24:02] I missed you all [14:26:11] o/ [14:26:13] good morning :) [14:33:29] Heyo Chris, great that you're back [14:39:17] 10Machine-Learning-Team: Upgrade the ml-etcd clusters to bullseye and PKI - https://phabricator.wikimedia.org/T330662 (10elukey) [14:41:51] PASS: 17 requests sent to inference-staging.svc.codfw.wmnet. All assertions passed. [14:41:54] \o/ [14:42:27] klausman: if you are ok I'd start reimaging etcd nodes for mlserve, one node at the time [14:42:33] with the remove/add member procedure [14:45:20] 10Machine-Learning-Team: Upgrade the ml-etcd clusters to bullseye and PKI - https://phabricator.wikimedia.org/T330662 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye [14:45:24] (started with ml-etcd2001) [14:49:36] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye completed: - dse-k8s-worker1006... [14:50:26] elukey: SGTM. Are you going to tmux it? [14:51:31] klausman: I started with one under my user, so you can probably attach sudoing as me, sorry didn't think about it [14:51:53] np [14:52:12] Just shoulder-surfing :) [14:52:47] ok also to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/892462 ? [14:53:08] we don't really need it [14:53:18] the flag should move from "new" to "existing" [14:54:00] LGTM [14:54:15] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye [15:00:35] So to recap, the procedure is: remove cluster member by hand, reimage machine, re-admit member? [15:01:26] I usually kick off the reimage (so the node is listed as down in etcdctl), remove and add [15:01:34] ah, ack [15:01:42] in this way the node in theory should be allowed to have a fresh raft log [15:03:28] puppet run failed [15:03:48] yes my fault [15:04:33] What happened? [15:05:01] (that RE in the logs is utterly impenetrable...) [15:05:24] you can always use install_console from cumin1001 to get a tty and inspect [15:05:38] so after https://gerrit.wikimedia.org/r/c/operations/dns/+/889661 I didn't change the related puppet setting [15:05:53] ah, so cert for wrong name? [15:06:12] yeah [15:06:16] I was convinced that I did it [15:06:27] anyway, sending a patch [15:08:24] this should do it https://gerrit.wikimedia.org/r/c/operations/puppet/+/892466 [15:08:52] lgtm [15:09:59] 10Machine-Learning-Team, 10Data-Engineering, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10fgiunchedi) Thank you for the heads up @jbond, cc @andrea.denisse [15:10:07] thanks :) [15:10:22] klausman: if you want to do eqiad we can split [15:10:45] Let's wait if this first one works outon't want to take down both sites [15:11:01] s/outon/out. Don/ [15:11:22] If it does, sure, will do [15:11:22] sure sure [15:17:28] puppet succeeded on 2001 [15:18:04] let's see if etcdctl is fine after the last reboot [15:19:04] Was 2001 the master before you removed it? [15:19:42] didn't check [15:31:01] ah snap now I remember why the change wasn't merged [15:31:19] so ml-etcd200[2,3] are on buster, so they don't have the new SAN [15:31:27] and 2001 now complains about it [15:31:36] could not get cluster response from https://ml-etcd2003.codfw.wmnet:2380: Get "https://ml-etcd2003.codfw.wmnet:2380/members": x509: certificate is valid for ml-etcd2003.codfw.wmnet, _etcd-server-ssl._tcp.ml-etcd.codfw.wmnet, not ml-etcd.codfw.wmnet [15:31:53] aah! [15:32:14] Hmm. I don't see an easy way around it except adding both old and new SANs for a time [15:35:33] weird one indeed [15:35:54] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye completed: - dse-k8s-worker1007 (... [15:35:58] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w... [15:36:03] well we just need one more SAN [15:36:17] ml-etcd.codfw.wmnet, but we'd need to add it to the cergen cert [15:38:00] otherwise I can downtime and just reimage all the nodes at once for the k8s upgrade [15:38:05] probably easier [15:38:28] Yeah, agreed. [15:43:55] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye [15:44:25] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye [15:46:06] elukey: so are you wiping 2002 and 2003 today? or waiting 'til tomorrow? [15:47:50] 10Machine-Learning-Team: Upgrade ML clusters to Kubernetes 1.23 - https://phabricator.wikimedia.org/T324542 (10elukey) [15:47:56] 10Machine-Learning-Team: Upgrade the ml-etcd clusters to bullseye and PKI - https://phabricator.wikimedia.org/T330662 (10elukey) 05Open→03Declined Tried with 2001 but failed to make it work. The new etcd version, on bullseye, requires a new TLS san in every etcd daemon's certificate to be able to run leader... [15:48:24] klausman: either tomorrow or later on in the week, I am going to prep the ml-serve-codfw's upgrade plan with code reviews [15:48:34] the cluster is not really used so we can take a slower pce [15:48:36] *pace [15:48:48] and it works with two etcd nodes [15:48:51] sounds good. [15:49:14] not ideal but at this stage we can experiment without too many regrets :) [15:56:29] 10Machine-Learning-Team: Upgrade the ml-etcd clusters to bullseye and PKI - https://phabricator.wikimedia.org/T330662 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye executed with errors: - ml-etcd2001 (**FAIL**) - Downt... [15:56:58] 10Machine-Learning-Team: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10elukey) [15:57:15] 10Machine-Learning-Team: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10elukey) [16:04:40] all code reviews out! [16:04:45] it was easier than expected after staging [16:05:05] chrisalbon: ml-staging-codfw is on k8s 1.23 and it didn't explode so far [16:06:42] I want to congratulate you now, but I fear I might wake Mr.Murphy. So I'll do it tomorrow :D [16:06:52] Yessssss [16:07:17] I see amazing progress has happened while I've been training to be a better manager [16:07:39] klausman: s/you/us :) [16:08:09] chrisalbon: we are not saying in any way that your absence boosted the team productivity by 200% [16:08:12] :D :D :D [16:08:14] * elukey is joking [16:08:38] elukey: ok, fine, _us_ :D [16:08:44] trying to collect the last things to do before declaring Lift Wing MV [16:08:47] *MVP [16:08:50] I think we are close [16:09:16] oh wow you all have been busy [16:10:17] finally a lot of things are coming up together [16:26:19] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MPhamWMF) [16:47:32] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye completed: - dse-k8s-worker1005 (... [16:54:14] aheading out now \o seeya tomorrow [17:08:12] o/ [17:38:30] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye completed: - dse-k8s-worker1008 (... [17:38:33] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w... [17:59:46] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 74 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) [18:05:46] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) The DSE cluster is on k8s 1.23! I deployed everything up to istio/cfssl, we'll do more as soon as we need. There seems to be an issue with hosts... [18:06:53] DSE cluster on k8s 1.23 as well :) [18:06:55] going afk folks! [18:06:59] have a nice rest of the day [18:07:00] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) 05Stalled→03Open p:05Triage→03Medium [18:10:14] bye luca! :) [19:55:02] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 74 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Krinkle) [23:21:24] 10Machine-Learning-Team, 10Data-Engineering, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse