[08:14:53] hello folks [08:15:07] I am again trying to upgrade the staging cluster to 1.23 [08:16:34] 10Machine-Learning-Team: Review ORES traffic to better understand Lift Wing's requirements - https://phabricator.wikimedia.org/T325763 (10elukey) [08:16:36] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey) [08:24:28] 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10kevinbazira) [08:28:25] (03PS2) 10Elukey: events: support multiple source events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) [08:28:49] found all things needed in page_change to create a revision-score event [08:35:50] (03PS3) 10Elukey: events: support multiple source events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) [09:09:57] elukey: \o will you be running it in a shared tmux again? [09:11:43] klausman: o/ yes yes there is one on cumin1001 running! [09:12:00] so far ml-staging-etcd2001 reimaged [09:12:03] 2002 under way [09:17:38] attached, let me know if I should(n't) resize the termianl [09:18:15] elukey: is the events patch ready for review? [09:19:05] is there any staging/sandbox kafka we can test stuff? [09:20:52] (03PS4) 10Elukey: events: support multiple source events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) [09:21:20] isaranto: it is now, I tested locally but not with page_change events (since I can't find any for the moment) [09:21:25] we don't have staging kafka [09:23:34] klausman: etcd on 2002 doesn't come up due to some raft error, weird [09:23:50] taking a look [09:25:00] and how do u test the events locally? do u do it via ssh tunneling? [09:25:27] isaranto: I use https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe#Run_KServe_locally_via_Docker [09:25:28] elukey:look like a certificate error [09:26:43] elukey: 🤦‍♂️ yeah thanks, got confused [09:27:00] klausman: mmm I think that is a red herring, 2003 is down [09:27:02] I see [09:27:03] tocommit(318398820) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost? [09:27:16] Ah. Did we do the wipe etc fs thing? [09:27:43] nono all etcds are down + puppet disabled before starting, then we reimage [09:27:48] one by one [09:28:35] ok, so that isn't the problem. [09:28:49] it seems this one https://github.com/etcd-io/etcd/issues/13509 [09:30:02] ah no wait snap [09:30:24] puppet is not disabled on 2003, etcd is up [09:30:51] Si cert error after all? [09:34:49] I think it is a weird state of the cluster [09:36:49] trying to remove 2002 via etcdctl explicitly [09:36:57] and run puppet on it to see how it goes [09:37:29] ack [09:38:59] keeping an eye on the logs [09:40:10] elukey: there is another weirdness [09:40:23] Feb 13 09:40:16 ml-staging-etcd2003 etcd[23875]: the local etcd version 3.2.26 is not up-to-date [09:40:25] Feb 13 09:40:16 ml-staging-etcd2003 etcd[23875]: member fce0f93975c27096 has a higher version 3.3.25 [09:42:40] yes this is because there is a bug in the cookbook, I have to fix it [09:42:46] puppet is not disabled properly [09:42:47] Ah, roger. [09:42:47] sigh [09:42:56] want me to do that (disable puppet)? [09:43:11] nono I am going to abort the cookbook, fix and then restart [09:43:18] Alright. [09:46:19] klausman: basically https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/888640/ [09:47:16] ah, one-character fixes are best fixes. [09:49:18] 10Machine-Learning-Team: [WikiGPT] Use moderation API from OpenAI - https://phabricator.wikimedia.org/T329058 (10isarantopoulos) We are using the moderation endpoint provided by OpenAI to filter the response if it is classified to belong in any of the following categories: - hate - hate/threatening - self-... [09:54:28] klausman: updated with a little tweak [09:57:25] lgtm [09:57:59] (03CR) 10Elukey: "Tested revscoring-predictor locally with revision-create, seems working fine! I need to find an example of page_change to use for testing " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [10:00:20] thanks! [10:00:27] I'll wait for Riccardo's +1 before proceeding [10:20:34] I found page_change events, they are on jumbo [10:25:47] (03PS5) 10Elukey: events: support multiple source events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) [10:26:24] (03CR) 10Elukey: "Found a page_change event, they are on jumbo for the moment, not on main. I tested locally with page change as well, it seems working, lem" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [10:30:02] ack! [10:38:40] klausman: merged the fix! But at this point I'll wait for this afternoon to start, what do you think? [10:38:57] sgtm, I am already feeling a bit hungry :) [11:06:19] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [11:07:54] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6d18f34a-0466-4d80-a764-16205fe28f4b) set by elukey@cumin1001 for 4:00:00 on 7 host(s) and their services... [11:08:12] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [11:12:56] 10Machine-Learning-Team, 10Patch-For-Review: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 (10elukey) @Ottomata I tried to add support for `page_change` in Lift Wing, it shouldn't be hard :) As far as I can see all the info that we need to create a re... [11:16:59] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [11:47:50] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10Sgs) I ran this script for adding the link-recommendation task type and and populating the excluded sections: `lang=bash PHAB=T304550... [11:56:11] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10Sgs) [12:22:59] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice work! I tested it locally and works great!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [13:23:27] (03CR) 10Klausman: [C: 03+1] events: support multiple source events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [13:56:49] 10Machine-Learning-Team, 10Patch-For-Review: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 (10Ottomata) Awesome! Now if only we can move to the new data model too! :D But naw, that is probably for newer ML streams, right? These you are just trying... [14:01:40] 10Machine-Learning-Team, 10Patch-For-Review: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 (10elukey) @Ottomata I tried `kafkacat -C -t eqiad.rc1.mediawiki.page_change -b localhost:9092` on kafka-main1001 and I don't get any event, meanwhile if I do i... [14:01:53] klausman_: [14:02:01] going to attemp another upgrade :) [14:02:06] Alright, I'm game [14:03:20] what the... already failed, with a strange error on the admin mgs [14:03:22] what in the... [14:04:08] It _looks_ almost like a Python upgrade broke cumin? [14:04:25] ah yeah of course disabled() doesn't take the same arg parameter as disabled() [14:04:28] * elukey cries in a corner [14:04:34] Ah! [14:04:44] well, that should be easily fixable. [14:05:18] mmm in theory yes https://doc.wikimedia.org/spicerack/master/api/spicerack.puppet.html#spicerack.puppet.PuppetHosts.disable [14:07:18] review incoming... [14:07:19] sigh [14:07:31] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/888706/ [14:07:55] lgmt! [14:08:14] danke :) [14:14:08] ok etcd2001 under reimage [14:15:40] ack [14:36:06] 10Machine-Learning-Team, 10Patch-For-Review: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 (10Ottomata) OHH!!! YOU are right! We are producing the rc1s to eventgate-analytics-external, I forgot. Sorry about that. Yes, the intention is to move to e... [14:36:24] elukey: so if page_change is not in kafka main...its hard for you to test with change prop eh? [14:36:27] is that right? [14:37:17] ottomata: yeah I can probably send some patches to make it work but I'd rather not :D [14:37:46] I was trying to figure out why the streams are in jumbo and not in main though [14:55:39] elukey: btw, run-puppet-agent on 2001 keeps failing because systemd thinks etcd is failing [14:56:21] yeah I was checking [14:57:01] at least now it is not up elsewhere :D [14:57:54] I bet that with a single node we need to do something [14:58:26] You mean something --assume-its-fine? [14:58:46] How this then even work on initial setup? [14:58:56] https://wikitech.wikimedia.org/wiki/Etcd#Bootstrapping_an_etcd_cluster [14:59:02] I don't find anything specific [14:59:04] 10Machine-Learning-Team: Upgrade ML clusters to Kubernetes 1.23 - https://phabricator.wikimedia.org/T324542 (10JMeybohm) [15:00:52] so systemctl says the service is "activating". I suspect it needs to touch a file somewhere or something to be considerd started? [15:02:17] It uses Type=notify: [15:02:19] Behavior of notify is similar to exec; however, it is expected that the service sends a "READY=1" notification message via sd_notify(3) or an equivalent call when it has finished starting up. systemd will proceed with starting follow-up units after this notification message has been sent. If this option is used, NotifyAccess= (see below) should be set to open access to the notification socket [15:02:19] so far my impression is that 2001 is trying to run a leader election with the other nodes failing [15:02:21] provided by systemd. If NotifyAccess= is missing or set to none, it will be forcibly set to main. [15:03:45] Still, this wouldn't work on initial setup, either [15:04:53] do we maybe need the --initial* flags? [15:05:20] what flags have you in mind? [15:05:47] if you look at man etcd, there are some --inital flags in the clustering section [15:06:07] --initial-cluster-state seems promising [15:06:23] But I dunno how they would be used in normal setup [15:08:12] The other option I could see is somehow getting NRPE to start (which is not starting due to a dep on etcd) and hoping that that will make Puppet sufficiently happy [15:09:45] we do have ETCD_INITIAL_CLUSTER_STATE="new" in /etc/default/etcd [15:11:42] Hurm. [15:12:53] so in the early logs I see [15:12:53] Feb 13 14:37:30 ml-staging-etcd2001 etcd[34856]: got bootstrap from DNS for etcd-server at 0=https://ml-staging-etcd2002.codfw.wmnet:2380 [15:12:56] Feb 13 14:37:30 ml-staging-etcd2001 etcd[34856]: got bootstrap from DNS for etcd-server at 1=https://ml-staging-etcd2003.codfw.wmnet:2380 [15:12:59] Feb 13 14:37:30 ml-staging-etcd2001 etcd[34856]: got bootstrap from DNS for etcd-server at ml-staging-etcd2001=https://ml-staging-etcd2001.codfw.wmnet:2380 [15:13:02] because we use discovery records [15:13:20] I suspect that etcd is trying to contact the rest of the nodes failing [15:13:25] So you think etcd now "knows too much already"? [15:13:56] The discoery records for 2002 and 2003 should be gone now, right? [15:14:47] we have some static config in the dns repo [15:14:52] SRV records [15:15:02] But that would be the case for an initial setup as well [15:26:19] ah now it works [15:26:56] I had to disable the _DISCOVERY_ bits in the /etc/default/etcd config [15:30:34] I'll wait to see if the cookbook recovers [15:35:31] klausman: reimaging 2002 [15:35:35] let's see if the trick worked [15:38:06] ah! so it doesn't even try, until it has someone to talk to [15:41:07] I think so, my impression is that with SRV records it tries a leader election or similar [15:41:18] 10Machine-Learning-Team: Fix WikiGPT copy link feature mobile view - https://phabricator.wikimedia.org/T329528 (10kevinbazira) [15:41:32] Should we add that change to what the cookbook does? [15:42:07] definitely, it may need another ask_confirmation step or similar to merge a puppet change [15:59:33] Ok, the two etcd's are running, but they disagree on the cluster ID [16:04:50] ok so 2002 is not starting [16:05:01] the "new" state is not ok, I had to change it to existing [16:05:03] and [16:05:04] etcd[3292]: could not get cluster response from https://ml-staging-etcd2001.codfw.wmnet:2380: Get "https://ml-staging-etcd2001.codfw.wmnet:2380/members": x509: certificate is valid for ml-staging-etcd2003.codfw.wmnet, _etcd-server-ssl._tcp.ml_staging_etcd.codfw.wmnet, ml-staging-etcd2001.codfw.wmnet, ml-staging-etcd2002.codfw.wmnet, not ml_staging_etcd.codfw.wmnet [16:12:06] going afk for a bit, check later [16:12:09] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Fix Armenian sentence tokenization bug in the link recommendation algorithm - https://phabricator.wikimedia.org/T327371 (10MGerlach) In short: I could resolve the issue when upgrading wikitextparser to version 0.51.1 (I previously used 0.45.1). I could re... [16:15:04] I'll keep an eye on things [16:22:18] 10Machine-Learning-Team, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Gehel) [16:26:08] 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search: Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10Gehel) p:05Triage→03High