[10:45:29] !log tools cleaning up old resourcequotas [10:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:57:07] !log tools cleaning old maintain-kubeusers configmaps [10:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:28:30] !log admin add all existing eqiad1 cloudvirts to new network-linuxbridge aggregate T364458 [12:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:28:38] T364458: Create new g4 flavors to support hypervisor migration from Linuxbridge to OVS Neutron agents - https://phabricator.wikimedia.org/T364458 [13:30:06] !log admin pin all existing eqiad1 flavors to linuxbridge hypervisors T364458 [13:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:30:12] T364458: Create new g4 flavors to support hypervisor migration from Linuxbridge to OVS Neutron agents - https://phabricator.wikimedia.org/T364458 [15:04:48] there's some issues happening on the ceph cluster, investigating, some VMs might be slow to do some io [15:06:56] * bd808 looks to see what's making stashbot sad [15:19:05] it looks like there might be connection issues at the moment? I can’t connect to tools via HTTPS nor SSH [15:19:05] same [15:19:05] (also bridgebot isn’t bridging IRC⇔Telegram, or if it is it’s being very delayed) [15:19:05] https://admin.toolforge.org/ is 404 [15:19:05] cloud VPS also affected as far as I can tell [15:19:05] Hi, is toolforge down? [15:19:05] ya [15:19:05] dcaro - look like it has become a full outage? [15:19:05] yes, it's cascading [15:19:05] i was using citation bot but it didn't work [15:19:05] !status Cloud outage [15:19:05] oh - no bot [15:19:05] JJMC89 What does that mean? [15:19:05] There is an ongoing WMCS outage right now [15:19:05] !log admin setting ceph cluster as noout + norebalance to avoid ceph recalculating shifts continuously [15:19:05] thx bd808 [15:19:05] thx all [15:24:17] Hi all! Seeing an incredible amount of logs from ceph-osd: "slow request osd_op ... currently delayed" [15:24:17] cwhite: active outage, folks are investigating [15:24:17] Got it, thanks! [15:24:17] i think toolforge works now [15:24:17] tested citation bot: works [15:24:19] Tested toolforge.org: it redirects properly [15:26:06] bd808 could you update the notice here? [15:26:58] oh no [15:26:58] nevermind [15:26:58] does not work now [15:28:45] Myrealnamm-alt: folks are working on the problem. I will update the topic as they verify things. The best thing you can do for the moment is to let them work. [15:28:56] ok, thanks [15:35:41] i'm logged in again. thanks for fixing. bye. [15:42:21] hm, https://versions.toolforge.org/ still isn’t loading for me… [15:43:03] Lucas_WMDE: is that your tool? If not, I can see about restarting it. [15:43:15] no, just an arbitrary tool [15:43:23] bridgebot also isn’t back yet, I can look into that one a bit [15:43:24] I've been trying to delete a stashbot pod for about 5 minutes now [15:43:48] I would expect things to be laggy while ceph rebalances and recovers, dcaro is that expected? [15:43:48] (on the other hand, https://lexeme-forms.toolforge.org/, one of my tools, *is* working) [15:44:01] (so some things are working after all) [15:44:32] based on my past experienced Kubernetes will take a bit to catch up processing all the things that happened or were supposed to happen during the outage [15:44:32] will look, it might be processes that got stuck on NFS [15:44:41] so pod deletions etc might take a while to go through [15:45:00] persondata: Your webservice of type php7.4 is running on backend kubernetes … but wget hangs … [15:45:08] a restart (toolforge jobs restart or toolforge webservice restart) would force it to get unblock [15:46:01] !log bd808@tools-bastion-12 tools.bridgebot Restarted bridgebot job [15:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [15:46:07] bridgebot is definitely seeing IRC messages [15:46:10] ok :) [15:46:31] yay, bridgebot is back! [15:47:21] Only 10 people there... (re @lucaswerkmeister: I think it’s still a decent place for such questions, there’s a lot of people lurking in there and occasional activity too) [15:48:01] I see 233 o_O [15:48:15] you’re on Libera Chat, right? not Freenode? [15:48:49] Freenode [15:49:03] we moved away from Freenode, two years ago or so I believe [15:49:13] all Wikimedia IRC is on https://libera.chat/ now [15:49:38] correction, three years (see “hostile takeover” at https://en.wikipedia.org/wiki/Freenode) [15:51:08] I am using HydraIRC and that does not have "Libera" on it [15:51:27] Lucaswerkmeister and Yetkin, I would see [15:51:28] [11:48:02] I see 233 o_O [15:51:29] [11:48:15] you’re on Libera Chat, right? not Freenode? [15:51:29] [11:48:49] Freenode [15:51:30] if you message, it will go through a bot [15:51:31] lucaswerkmeister, taavi is restarting worker nodes which should resolve lingering NFS issues and reschedule pods. iirc it takes quite a while to go through the whole set so you may things unstick gradually. [15:51:45] *may see things unstick [15:52:31] andrewbogott: alright, thanks [15:52:54] Yetkin, Libera is available by typing https://web.libera.net into your internet browser [15:53:16] Sorry, I meant https://web.libera.chat [15:55:03] !log restarting mon service on cloudcephmon1002 to try to release the 2 stuck ops left [15:55:03] dcaro: Unknown project "restarting" [15:55:12] !log admin restarting mon service on cloudcephmon1002 to try to release the 2 stuck ops left [15:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:55:42] Myrealnamm-alt: thanks 😊 [15:55:44] there’s also some more information at https://meta.wikimedia.org/wiki/IRC/Migrating_to_Libera_Chat, if that helps [16:00:35] lucaswekmeister, what are you using? You also appear as forwarded messages by a bot [16:02:23] the channel is bridged between Telegram and IRC [16:02:53] https://t.me/wmcloudirc [16:35:52] Feel free to write something up on the schedule-deployment woarkboard @lucaswerkmeister. I'm not sure if WikmediaDebug is the right tool to hack to update those, but it would be neat if something did. One possibility would be to add an endpoint to schedule-deployment that `scap backport` could notify when it finishes with a patch. (re @lucaswerkmeister: bd808: tiny idea [16:35:52] for sched [16:35:52] ule-deployment – the WikimediaDebug extension could, when it’s installed, automatically modify this me...) [16:37:05] matterbridge could do a better job of long message splitting... [16:41:37] !logs tools rebooting tools-sgebastion-10 as it's stuck on nfs [17:25:03] PSA: when you type `:sort` in a vim buffer instead of `:split` you get very different results than you expected. [17:35:06] bd808: filed T367213 for schedule-deployment WikimediaDebug integration [17:35:44] I’m not sure what you had in mind with “one possibility would be…” – adding checkmarks to the calendar to indicate changes that were actually deployed? [17:35:51] (I never bother to do that tbh ^^) [17:38:26] @lucaswerkmeister: heh. Yeah, I somehow thought you were talking about a completely different thing, [17:39:12] I was thinking about "closing the loop" by showing what was actually deployed as you note [17:40:33] heh, I see [18:27:28] is there a document about code review / merge guidelines for operations/puppet? writing one for another org, wanna steal [18:33:33] https://www.mediawiki.org/wiki/Gerrit/Code_review [18:34:06] hmm, probably not really what you wanted [18:34:30] some of it is: this section: https://www.mediawiki.org/wiki/Gerrit/Code_review#Complete_the_review [18:51:52] mutante ya, i primarily want to hear about the fact that in ops/puppet, you are still expected to merge your own patches (after review) because the act of merging is an act of deployment [18:52:03] unlike say mediawiki/mediawiki [20:24:17] yuvipanda: there is one sentence here, saying "+2 also means I will deploy this".. but I guess that's it: https://wikitech.wikimedia.org/wiki/Puppet#Updating_operations/puppet