[07:16:27] gitlab needs a short maintenance reboot in 45 minutes, at 8:00 UTC [08:07:04] GitLab maintenance done [09:03:57] gerrit will also need a short maintenance reboot, at 11:00 UTC [09:32:35] Hello. Is there a deployer around who could merge/deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1249932 ? [09:33:01] It tidies up now-absent periodic job [09:35:40] arnaudb: Is gerrit ok? [09:35:54] It's brutally slow for me [09:40:38] checking claime [09:40:58] oof indeed so slow [09:41:08] I'll revert the ATS change [09:41:46] (can confirm just got a 502 on gerrit) [09:46:01] a workaround if you need to push/pull is to swap the https origin with the ssh one temporarily [09:46:47] I've sent the fix, waiting for the UI to allow me to +2 and submit: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1253396 [09:48:34] waiting for the submit button to load :/ [09:49:44] arnaudb: https://wikitech.wikimedia.org/wiki/Puppet#Gerrit_is_down_(and_requires_a_puppet_change_to_put_it_back) [09:51:13] I'm missing the submit button wtf [09:51:21] XioNoX: gerrit is available-ish [09:51:45] also it's an ATS change which is a bit harder manually :( [09:52:57] I'll workaround the ui and hit the submit via ssh, hold on [09:53:32] arnaudb: if you can document that after it's fixed, that would be awesome [09:54:50] arnaudb: I was able to hit submit (I think) [09:55:10] yes I can confirm [09:55:18] ssh -p29418 arnaudb@gerrit.wikimedia.org gerrit review 1253396,1 --submit [09:55:18] error: fatal: change is merged [09:55:31] are you running puppet merge? or should I? [09:55:35] on it [09:55:40] great [09:56:01] `Fetching new commits from: https://gerrit.wikimedia.org/r/labs/private` → it'll take a while imho [09:56:33] Hmm puppet merge uses https fetch? What's the rationale? [09:56:50] we should probably switch it to use the discovery record [09:57:18] change has been fetched, merging in progress [09:57:34] the labs fetch has failed btw [09:59:38] merge over [10:01:16] vgutierrez: should I trigger a puppet-agent run on the cp* hosts? [10:13:04] arnaudb: I think so, do it very slowly and in batches please, so the recovery will be quicker [10:13:06] I would say yes [10:13:21] Maybe -b 5 [10:14:23] for sanity check https://www.irccloud.com/pastebin/Ro0LwBlc/sanity.txt [10:15:21] elukey claime lgty? [10:15:59] arnaudb: cp-text [10:16:04] will go faster [10:16:50] `$ sudo cumin 'A:cp-text' 'run-puppet-agent' -b 5` [10:17:06] 63 hosts in progress [10:18:19] thanks :) [10:20:33] gerrit is much faster for me again, I seem to be lucky with the cp server [10:20:47] <_joe_> for the next time, go with -b 30 [10:20:55] <_joe_> with puppet runs [10:21:07] ack [10:22:08] _joe_: well to be fair we asked to go slowly since it's cp nodes [10:22:29] fixed for me as well [10:22:43] monitoring is sending recoveries [10:34:17] arnaudb: what was the rush re-enabling connection reuse' [10:34:18] :? [10:35:37] there was no rush, I've rolled the change on spare and replica with no issue, so I figured it was safe for the primary, and then I was surprised by the UI performance degradation [10:36:21] the caveat is neither of the other instances have the UI enabled, I'll debug that issue on the spare instance [10:36:40] from ATS PoV gerrit became slower [10:40:24] <_joe_> arnaudb: I guess the takeaway is that you need a better way to test gerrit CDN changes [10:40:41] <_joe_> because ofc a system that gets no traffic doesn't have the same properties as one that does [10:41:12] 100% ↑ [10:41:46] _joe_: I have an approach for that specific bug with the spare instance but we need a longer term solution [10:42:12] <_joe_> like pontoon? [10:42:19] * _joe_ tags godog in [10:42:22] (puppet has been run on all cp-text instances) [10:43:53] I've also created T420184 to discuss the fetch endpoint, because we might have similar issues now that gerrit is behind CDN [10:43:53] T420184: Edit puppet-merge to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org? - https://phabricator.wikimedia.org/T420184 [10:44:56] Heads-up oncallers, we are about to reimage and upgrade the druid-public cluster (druid10[09-13]) - There will be downtime for some AQS endpoints etc for the next 30-60 minutes. [10:50:20] claime: doc updated https://w.wiki/JoMy [10:52:27] reminder: gerrit will be rebooted at 11:00 UTC [10:54:58] arnaudb: <3 tyvm [10:56:42] <_joe_> btullis: isn't AQS used in some public APIs used by e.g. the mobile apps? [10:57:23] Yes, it was unavoidable. We have announced this maintenance window well ahead of time. [11:07:04] <_joe_> ok ok [11:07:21] <_joe_> I wanted to make sure you had actually notified the more impacted people :D [11:13:01] Thanks _joe_ <3 [11:14:37] ;;/30 [11:14:43] >_< [11:22:43] that's not how you quit vi ;p [11:24:03] <_joe_> Emperor: aren't you amazed they don't even need to press the pedal? [11:25:07] <_joe_> (ofc no one needs a pedal to leave emacs either; you just need to type "sudo halt") [11:26:43] isn't `claude close vim for me` nowadays? :-P [11:27:30] <_joe_> volans: the joke is you still use an editor yourself [11:27:37] <_joe_> so old fashioned [11:28:11] I have also an IDE running :D [11:28:36] <_joe_> where you only use the tab to chat with your coding agent, I hope [11:30:04] _joe_: I think we're not meant to refer to junior staff as "coding agents" [11:31:45] <_joe_> I was just having a chat with a former colleague that works in the industry and told me his bosses have put a ban on hiring junior devs [11:32:26] <_joe_> because of, no one worries about global warming, why would our industry worry of when the current generations of developers will retire? [11:32:32] <_joe_> *ofc [11:34:15] 🤬 [11:34:18] Emperor: https://youraislopboresme.com/ [11:35:07] _joe_: Yeah, the "AI will replace jobs" is a convenient figleaf for "I think the world is gonna end so why bother teaching" [11:35:33] Which is such a global treason of the oldest human tradition of transmitting knowledge it makes me sick if I think about it for too long [11:36:00] <_joe_> yeah I was just pointing out the trend [11:36:28] <_joe_> OTOH, I've already encountered new volunteers who write code only ai-assisted, and run away the moment I ask them not to :) [11:36:33] (and I say human but tbh, most animals do it...) [11:36:40] <_joe_> so it's a two edged sword [11:36:56] also phabricator updates [11:37:39] <_joe_> volans: I could partially understand that for someone with not great control of written english [11:37:42] _joe_: Is it better than "Here's an AI generated phab comment on how I would do this" "Please don't use AI generated comments" "Ok but is it still the right approach and can I vibecode the actual implementation" [11:38:09] _joe_: that ^^ [11:38:35] <_joe_> claime: yeah... we're just saying humans are inherently lazy and will take short-term convenience over long-term self-interest? [11:38:54] <_joe_> what will we discover next, that water can be warmed up to be hot? [11:39:41] I've a startling revelation about the religious leanings of the Bishop of Rome... [11:40:39] T419967 as an example [11:40:40] T419967: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967 [11:40:47] <_joe_> volans: I'm aware [11:41:11] <_joe_> there's been talks in #engineering-all on slack about having a policy [11:41:22] I'm not sure it's just inherent laziness. There's a nihilistic component being fed by people that want us to stop thinking so they can sell us interfaces to computers to think for us. [11:41:30] for now we have Andre :D [11:41:43] <_joe_> I don't have time to start working on one, but I think it's quite important [11:42:06] "Sure, our kids don't know how to do anything, but for a time, we generated so much shareholder value" [11:42:14] my views on GenAI made LWN the other week, I gather [11:42:36] <_joe_> claime: that's one side of the coin, sure [11:44:16] Re Phab comments written by AI: I brought this up with my manager and yeah I was told that WMF is working on some AI guidelines (some Slack discussion), so... I'm waiting [11:44:22] (personally I'm afraid that random AI slop will at some point just DoS us humans) [11:45:42] I've also heard counter-arguments like "folks who don't speak English well may use AI to feel more confident"; now I start wondering if I should wear a mask in video calls if I start to not like my face (if I got the argument correctly) [11:45:46] (very legit worry since it's already happening for some projects like curl) [11:45:55] yeah I saw that post by the curl folks [11:57:48] don't get me wrong, I think that's a legitimate worry (and I think it is ok to not show a face on cam always), but the fix is to make sure we are open to speak with people with bad english, or not making cam mandatory on all cases [11:58:10] I speak terrible english but you still speak to me <3 [12:39:11] I've created T420205 for a cleanup, good first task for a new engineer (it is a single sed line and a single patch, but requires some testing) [12:39:13] T420205: Remove deprecated Type=simple from custom systemd units - https://phabricator.wikimedia.org/T420205 [12:58:27] claime: re https://youraislopboresme.com, somewhat ironically, that website seems to be (or, at least, seems like it was) itself a potentially ai-generated wraparound of https://youraislopbores.me/ :p [12:58:27] https://www.reddit.com/r/YourAISlopBoresMe/comments/1rp9e7t/this_website_popping_up_instead_of_the_real_one/ [12:59:04] A_smart_kitten: Aaah I got got by autocomplete [12:59:11] very very sorry [12:59:16] :P [12:59:22] no worries :p [12:59:29] (and very ironic) [12:59:35] (don't you think) [13:06:33] is alert1002 entirely happy? my roll-reboot cookbook is spending a lot of time waiting for hosts to be downtimed [13:08:34] host graphs look ok [13:09:58] It would also seem like it's not actually downtiming the host down alerts [13:10:32] ISTR the roll-reboot cookbook eventually gives up waiting for the downtime to be applied (12 attemps, 10s apart) [13:46:36] said cookbook also keeps suffering from icinga recovery taking unusually long [14:05:43] ^ same [14:05:51] 09:51:44 < sukhe> I am wondering if it is due to the number of the reboots underway [14:11:15] <_joe_> andre: re - mask... I wouldn't see a problem :P [14:11:22] ehehe [14:12:19] <_joe_> my point was - what matters is if the comment makes sense in the context of that phab task or if it's just generative slop [14:12:53] <_joe_> if someone is just asking a chatbot "can you translate 'xyz' to corporate english for me?" that's ok [14:13:03] <_joe_> better than if they used google translate ofc [14:13:40] <_joe_> LLMs are useful, the problem is people tend to think it's a magical technology that possesses actual intelligence [14:13:47] <_joe_> and so delegate thinking to it [14:14:00] <_joe_> which makes almost as much sense as delegating your thinking to a bash script [14:14:31] <_joe_> [14:14:37] "Go away, or I will replace you with a very small shell script" <-- insults from the 90s that come back into vogue [14:16:44] I think I agree with all you wrote [14:18:50] also, icinga is telling me I'm "Not Authorized" to reschedule checks [14:25:01] Emperor: modules/icinga/files/cgi.cfg has you enabled [14:25:29] but if you log into Icinga with e.g. mvernon in lowercase the UID lookup is case-insensitive per LDAP [14:26:04] so the SSO login will work, but the Icinga-internal CGI will be case-sensitive and deem mvernon not allowed [14:26:23] so if you login as MVernon it should allow you to reschedule checks [14:27:02] <_joe_> https://media1.tenor.com/m/MT20XW_B_S8AAAAd/lindsay-ellis-no-thanks.gif [14:27:04] Hm, LMS if I can find a logout button [14:31:05] https://idp.wikimedia.org/logout [14:31:54] Yeah, got there in the end, thanks. The case thing is a bit of a UI wrinkle. Thanks :) [14:46:32] now I can reschedule checks for "now", it just doesn't have any effect. I'm guessing too many things in-flight at once, but it is definitely slowing down my rebooting efforts [15:08:38] I think it might be related to what Riccardo found in #wikimedia-observability, IIRC if there's an error in the Icinga config Icinga commands fail to be processed [15:10:44] that's only a warning though, the config is "ok" [15:14:35] ah,ok [15:28:37] I'm also having issues with the downtime cookbook: https://phabricator.wikimedia.org/P89866 [15:28:55] weird thing is that after 12 retries it said DONE (PASS) but the silence was not created in icinga [16:19:53] howdy. it looks like we are seeing the "passive checks are awol" issue for the frack hosts in icinga. it's been related to T196336 and usually requires an icinga restart. [16:19:53] T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 [16:20:21] we are in the process of switching to alertmanager but just haven't gotten there yet. [16:21:13] there may be a chunk of warning messages to clear from the mail queues related to it also. [16:22:06] dhinus: it gives up after 12 retries [16:22:38] claime: shouldn't the cookbook end with a FAIL? [16:23:21] dhinus: *shrugs* :P [16:31:55] claime: I'm an optimist and seeing PASS I hoped the 12th attempt was successful... obvs it wasn't :) [16:32:27] tbh looking at the output it does say "Some hosts are not yet downtimed" [16:49:45] I'll try giving icinga a restart [16:53:26] be aware of config changes that break like recently removed contacts that are still group members as part of offboarding [16:56:18] herron: thanks. [16:56:48] we are hopefully only a week or so away from transitioning off of icinga. [16:57:35] !log contint2002 - rebooting [16:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:08] herron: icinga is looking good again. thanks! [17:01:38] dwisehaupt: glad to hear! also great news re transitioning away!