[00:20:32] <wikibugs>	 10Machine-Learning-Team, 10Growth-Team, 10PageTriage: Detection and flagging of articles that are AI/LLM-generated - https://phabricator.wikimedia.org/T330346 (10Tgr) >>! In T330346#8642831, @Novem_Linguae wrote: > https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Articles_for_creation#ChatGPT_and_oth...
[03:59:58] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Northern Luri Wikipedia model training pipeline failed - https://phabricator.wikimedia.org/T330616 (10kevinbazira)
[04:24:47] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Northern Luri Wikipedia model training pipeline failed - https://phabricator.wikimedia.org/T330616 (10Bugreporter) This Wikipedia is already closed. See {T272041}.
[07:24:20] <elukey>	 good morning :)
[07:44:52] <wikibugs>	 10Machine-Learning-Team: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) Finally after a lot of digging I found the issue: the `ingressgateway` pods are missing a network policy to allow traffic to flow to port 8081. The `knative-local-gateway` replaces the...
[07:45:10] <elukey>	 klausman: o/ --^ is my best explanation for the 10s request mistery
[07:53:56] <elukey>	 and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/892353 should fix it
[09:09:06] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team, 10Epic: Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey)
[09:09:08] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Implement an online feature store - https://phabricator.wikimedia.org/T294434 (10elukey)
[09:19:32] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[09:32:35] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w...
[09:35:27] <klausman>	 elukey: \o Having a look-see
[09:39:23] <klausman>	 Nice sleuthing!
[09:40:21] <elukey>	 \o/
[09:42:11] <elukey>	 the PREROUTING chain is definitely something to keep in mind
[09:42:16] <elukey>	 tcpdump will be affected etc..
[09:42:36] <klausman>	 Yeah, overlay networks are great when they work fine, but they turn into mud when they don't
[10:01:19] <wikibugs>	 10Machine-Learning-Team: Automate the procedure to bootstrap minikube on the ML-Sandbox and to share it by multiple users - https://phabricator.wikimedia.org/T305447 (10elukey) Ilias created https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe/DeployLocal, we should probably use it as baseline to...
[10:02:19] <wikibugs>	 10Machine-Learning-Team: Add documentation about LiftWing to the API Portal - https://phabricator.wikimedia.org/T325759 (10elukey) Opened T330634 to the API platform folks to grant us access to the API gateway portal.
[10:26:41] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[10:43:39] <elukey>	 aiko: o/
[10:50:07] <elukey>	 the httpbb script for staging returns
[10:50:07] <elukey>	 https://revert-risk-model-predictor-default.experimental.wikimedia.org/v1/models/revert-risk-model:predict (test_liftwing_staging.yaml:100) Status code: expected 200, got 404.
[10:50:20] <elukey>	 I guess that this is due to the new split between RR model servers..
[10:50:25] <elukey>	 can you check it when you have time?
[11:06:03] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w...
[11:06:12] <aiko>	 elukey: o/ the model name changed to "revertrisk" from "revert-risk-model"
[11:07:23] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[11:07:33] <aiko>	 elukey: I'll send a patch to the httpbb script
[11:07:37] <elukey>	 <3
[11:20:26] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w...
[11:36:44] * elukey lunch
[11:52:05] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[12:04:59] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-...
[12:21:39] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[12:38:54] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10jbond)
[13:28:33] <wikibugs>	 10Machine-Learning-Team, 10Language-Team, 10serviceops-radar: Hosting machine request for machine translation - https://phabricator.wikimedia.org/T329971 (10JMeybohm)
[13:29:06] * klausman lunch
[13:55:06] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10kevinbazira) 21/22 models were trained successfully in the 11th round of wikis.  The Northern Luri Wikipedia (lrcwiki) pipeline did not complete succe...
[13:57:02] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10kevinbazira)
[14:14:50] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye
[14:23:53] <chrisalbon>	 Goood morning all!
[14:23:58] <chrisalbon>	 I’m back!
[14:24:02] <chrisalbon>	 I missed you all
[14:26:11] <elukey>	 o/
[14:26:13] <elukey>	 good morning :)
[14:33:29] <klausman>	 Heyo Chris, great that you're back
[14:39:17] <wikibugs>	 10Machine-Learning-Team: Upgrade the ml-etcd clusters to bullseye and PKI - https://phabricator.wikimedia.org/T330662 (10elukey)
[14:41:51] <elukey>	 PASS: 17 requests sent to inference-staging.svc.codfw.wmnet. All assertions passed.
[14:41:54] <elukey>	 \o/
[14:42:27] <elukey>	 klausman: if you are ok I'd start reimaging etcd nodes for mlserve, one node at the time
[14:42:33] <elukey>	 with the remove/add member procedure
[14:45:20] <wikibugs>	 10Machine-Learning-Team: Upgrade the ml-etcd clusters to bullseye and PKI - https://phabricator.wikimedia.org/T330662 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye
[14:45:24] <elukey>	 (started with ml-etcd2001)
[14:49:36] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye completed: - dse-k8s-worker1006...
[14:50:26] <klausman>	 elukey: SGTM. Are you going to tmux it?
[14:51:31] <elukey>	 klausman: I started with one under my user, so you can probably attach sudoing as me, sorry didn't think about it
[14:51:53] <klausman>	 np
[14:52:12] <klausman>	 Just shoulder-surfing :)
[14:52:47] <elukey>	 ok also to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/892462 ?
[14:53:08] <elukey>	 we don't really need it 
[14:53:18] <elukey>	 the flag should move from "new" to "existing"
[14:54:00] <klausman>	 LGTM
[14:54:15] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye
[15:00:35] <klausman>	 So to recap, the procedure is: remove cluster member by hand, reimage machine, re-admit member?
[15:01:26] <elukey>	 I usually kick off the reimage (so the node is listed as down in etcdctl), remove and add
[15:01:34] <klausman>	 ah, ack
[15:01:42] <elukey>	 in this way the node in theory should be allowed to have a fresh raft log
[15:03:28] <klausman>	 puppet run failed
[15:03:48] <elukey>	 yes my fault
[15:04:33] <klausman>	 What happened?
[15:05:01] <klausman>	 (that RE in the logs is utterly impenetrable...)
[15:05:24] <elukey>	 you can always use install_console from cumin1001 to get a tty and inspect
[15:05:38] <elukey>	 so after https://gerrit.wikimedia.org/r/c/operations/dns/+/889661 I didn't change the related puppet setting
[15:05:53] <klausman>	 ah, so cert for wrong name?
[15:06:12] <elukey>	 yeah
[15:06:16] <elukey>	 I was convinced that I did it
[15:06:27] <elukey>	 anyway, sending a patch
[15:08:24] <elukey>	 this should do it https://gerrit.wikimedia.org/r/c/operations/puppet/+/892466
[15:08:52] <klausman>	 lgtm
[15:09:59] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10fgiunchedi) Thank you for the heads up @jbond, cc @andrea.denisse
[15:10:07] <elukey>	 thanks :)
[15:10:22] <elukey>	 klausman: if you want to do eqiad we can split
[15:10:45] <klausman>	 Let's wait if this first one works outon't want to take down both sites
[15:11:01] <klausman>	 s/outon/out. Don/
[15:11:22] <klausman>	 If it does, sure, will do
[15:11:22] <elukey>	 sure sure
[15:17:28] <elukey>	 puppet succeeded on 2001
[15:18:04] <elukey>	 let's see if etcdctl is fine after the last reboot
[15:19:04] <klausman>	 Was 2001 the master before you removed it?
[15:19:42] <elukey>	 didn't check
[15:31:01] <elukey>	 ah snap now I remember why the change wasn't merged
[15:31:19] <elukey>	 so ml-etcd200[2,3] are on buster, so they don't have the new SAN
[15:31:27] <elukey>	 and 2001 now complains about it
[15:31:36] <elukey>	 could not get cluster response from https://ml-etcd2003.codfw.wmnet:2380: Get "https://ml-etcd2003.codfw.wmnet:2380/members": x509: certificate is valid for ml-etcd2003.codfw.wmnet, _etcd-server-ssl._tcp.ml-etcd.codfw.wmnet, not ml-etcd.codfw.wmnet
[15:31:53] <klausman>	 aah!
[15:32:14] <klausman>	 Hmm. I don't see an easy way around it except adding both old and new SANs for a time
[15:35:33] <elukey>	 weird one indeed
[15:35:54] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye completed: - dse-k8s-worker1007 (...
[15:35:58] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w...
[15:36:03] <elukey>	 well we just need one more SAN
[15:36:17] <elukey>	 ml-etcd.codfw.wmnet, but we'd need to add it to the cergen cert
[15:38:00] <elukey>	 otherwise I can downtime and just reimage all the nodes at once for the k8s upgrade
[15:38:05] <elukey>	 probably easier
[15:38:28] <klausman>	 Yeah, agreed.
[15:43:55] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye
[15:44:25] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye
[15:46:06] <klausman>	 elukey: so are you wiping 2002 and 2003 today? or waiting 'til tomorrow?
[15:47:50] <wikibugs>	 10Machine-Learning-Team: Upgrade ML clusters to Kubernetes 1.23 - https://phabricator.wikimedia.org/T324542 (10elukey)
[15:47:56] <wikibugs>	 10Machine-Learning-Team: Upgrade the ml-etcd clusters to bullseye and PKI - https://phabricator.wikimedia.org/T330662 (10elukey) 05Open→03Declined Tried with 2001 but failed to make it work. The new etcd version, on bullseye, requires a new TLS san in every etcd daemon's certificate to be able to run leader...
[15:48:24] <elukey>	 klausman: either tomorrow or later on in the week, I am going to prep the ml-serve-codfw's upgrade plan with code reviews
[15:48:34] <elukey>	 the cluster is not really used so we can take a slower pce
[15:48:36] <elukey>	 *pace
[15:48:48] <elukey>	 and it works with two etcd nodes 
[15:48:51] <klausman>	 sounds good. 
[15:49:14] <elukey>	 not ideal but at this stage we can experiment without too many regrets :)
[15:56:29] <wikibugs>	 10Machine-Learning-Team: Upgrade the ml-etcd clusters to bullseye and PKI - https://phabricator.wikimedia.org/T330662 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye executed with errors: - ml-etcd2001 (**FAIL**)   - Downt...
[15:56:58] <wikibugs>	 10Machine-Learning-Team: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10elukey)
[15:57:15] <wikibugs>	 10Machine-Learning-Team: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10elukey)
[16:04:40] <elukey>	 all code reviews out!
[16:04:45] <elukey>	 it was easier than expected after staging
[16:05:05] <elukey>	 chrisalbon: ml-staging-codfw is on k8s 1.23 and it didn't  explode so far
[16:06:42] <klausman>	 I want to congratulate you now, but I fear I might wake Mr.Murphy. So I'll do it tomorrow :D
[16:06:52] <chrisalbon>	 Yessssss
[16:07:17] <chrisalbon>	 I see amazing progress has happened while I've been training to be a better manager
[16:07:39] <elukey>	 klausman: s/you/us :)
[16:08:09] <elukey>	 chrisalbon: we are not saying in any way that your absence boosted the team productivity by 200%
[16:08:12] <elukey>	 :D :D :D
[16:08:14] * elukey is joking
[16:08:38] <klausman>	 elukey: ok, fine, _us_ :D
[16:08:44] <elukey>	 trying to collect the last things to do before declaring Lift Wing MV
[16:08:47] <elukey>	 *MVP
[16:08:50] <elukey>	 I think we are close
[16:09:16] <chrisalbon>	 oh wow you all have been busy
[16:10:17] <elukey>	 finally a lot of things are coming up together
[16:26:19] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MPhamWMF)
[16:47:32] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye completed: - dse-k8s-worker1005 (...
[16:54:14] <klausman>	 aheading out now \o seeya tomorrow
[17:08:12] <elukey>	 o/
[17:38:30] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye completed: - dse-k8s-worker1008 (...
[17:38:33] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed with errors: - dse-k8s-w...
[17:59:46] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 74 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson)
[18:05:46] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) The DSE cluster is on k8s 1.23! I deployed everything up to istio/cfssl, we'll do more as soon as we need. There seems to be an issue with hosts...
[18:06:53] <elukey>	 DSE cluster on k8s 1.23 as well :)
[18:06:55] <elukey>	 going afk folks!
[18:06:59] <elukey>	 have a nice rest of the day
[18:07:00] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) 05Stalled→03Open p:05Triage→03Medium
[18:10:14] <aiko>	 bye luca! :)
[19:55:02] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 74 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Krinkle)
[23:21:24] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse