[01:01:22] denisse: Thanks :) Looking good at https://performance.wikimedia.org/xhgui/. I've deployed the upgrade right away as well, and that's now live too. The main difference people might notice is the new "server name" column on the home page, otherwise, pretty much the same as before in the frontend. in the backend, it's now ready for PHP 8 and has all the latest dependencies. [01:02:08] Krinkle: Amazing progress, thank you for the change and also for the upstream patch!! :) [01:05:44] denisse: could you make T340713 public now, or should it remain private? [01:12:47] Krinkle: I think I don't have the access right to make it public, asking for it to be public.. [01:19:17] Okay :) [09:29:32] hello on-callers, I deployed eventgate-main for https://phabricator.wikimedia.org/T338357, in theory nothing bad should happen but we increased batching for the eventgate main's kafka producer, so if you see anything weird reported for the job queues please ping me [10:57:11] moritzm: (or anybody else) are you aware of any issues with clang, C++ API and bookworm? clang-13 fails to compile katran in bookworm but works in bullseye, this is the error I'm getting https://www.irccloud.com/pastebin/Mw9l2dLl/ [10:57:53] same cmake version being used, same -std=gnu++17 passed to clang [11:04:56] that's how BpfLoader.cpp is being compiled if you're curious https://www.irccloud.com/pastebin/hkoDMTcr/ [11:29:20] hmmh, I'm not aware of any general issue, and there's various applications which use LLVM over GCC to build C++ (most notably Chromium) [11:29:49] maybe this is a case of CLANG 15 being stricter/more conformant while 13 used to be more lenient [11:30:12] This is clang 13 VS 13 [11:30:28] Same issue with clang-14 in bookworm [11:31:01] ah, the LLVM backport that was added for Firefox ESR, right [11:31:24] (since natively buster only has 11) [11:32:52] maybe doublecheck that the packages provided by https://tracker.debian.org/pkg/llvm-defaults all point to the right version [11:33:53] But I have no obvious "this must be X" hint unfortunately [13:12:27] moritzm: compiler options are the same so yeah.. I'm gonna get the bookworm up and running and get a list of packages [13:13:25] ack, sounds good [14:00:43] FYI alerts.git will be getting deployed to all k8s-monitoring promethei, there might be additional alerts firing (cfr https://gerrit.wikimedia.org/r/c/operations/puppet/+/937079) [14:01:39] jhathaway urandom ^ (as next oncallers) [14:01:58] godog: thanks [14:02:20] godog: fun! [14:02:28] jhathaway: sure np! [14:02:49] urandom: inorite? [14:03:03] duesen: fyi, we move to the next phase of migrating jobqueue metrics to histograms: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/937090 [14:04:02] I 've identified keys panels and alerts in the comments and I 'll fix those after merging, but searching through all grafana dashboards/alerts isn't feasible unfortunately. So if you have any other stuff you know of, please let me know [14:37:28] arturo: andrewbogott: hi! heads-up that I will be upgrading pdns-rec 4.6.2 to 4.8.4 this week since 4.6 is EOL and no longer receiving updates [14:37:47] no action required from you as I will take care of the deployment but since it affects the cloud hosts, I thought you should know [14:38:22] sukhe: you'll also be upgrading on cloudservices hosts? [14:38:43] I am happy to if you had like but also I can skip and let you do it [14:38:50] whatever works, but yes, my plan was to upgrade everywhere [14:39:05] https://debmonitor.wikimedia.org/packages/pdns-recursor [14:40:17] I think it's fine for you to do it as long as I'm awake when you upgrade :) [14:40:22] yep totally fair [14:40:39] I plan to update rec dns hosts first so when I get to yours, I will ping you [14:40:42] thanks! [14:40:47] most likely Thursday [14:40:48] sounds good, thanks for the warning [14:40:53] Thursday should be good [14:40:57] ok! [15:19:15] akosiaris: thanks for the heads up, I have added a comment [15:31:54] mutante: resolve T255132 and assignt to Mor? [15:31:54] T255132: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 [15:32:31] jynus: since the only thing that kept me from it was that he asked for your ok.. sure:) thanks [15:32:47] thanks [15:33:11] clicks [15:33:17] already done:) [16:10:30] The last Puppet run was at Tue Jul 11 09:32:38 UTC 2023 (397 minutes ago). Puppet is disabled. roll out 93627 [16:10:44] jbond: do you have an ETA on that? :) [16:11:44] (copy paste missed one char, it's actually 936273) [16:13:19] vgutierrez: which host i thught i had re-enabled it every where allready [16:13:26] cp6001 [16:13:43] hmm maybe i missed drmrs [16:13:46] jbond: yep [16:13:47] https://thanos.wikimedia.org/graph?g0.deduplicate=1&g0.expr=puppet_agent_enabled%20%3D%3D%200&g0.max_source_resolution=0s&g0.partial_response=0&g0.range_input=1h&g0.stacked=0&g0.store_matches=%5B%5D&g0.tab=1 [16:13:52] drmrs seems to have it disabled [16:14:09] re-enabling now sorry about that [16:14:34] cp[6001,6003-6016].drmrs.wmnet are impacted by your puppet message [16:14:57] vgutierrez: shuld all eb enabled now sorry about that [16:15:19] and we got cp6002 with puppet disabled without any good reason at all [16:15:29] fabfur: is that you? [16:15:43] no [16:16:14] The last Puppet run was at Mon Jul 10 14:09:02 UTC 2023 (1567 minutes ago). Puppet is disabled [16:16:16] yikes [16:17:01] without reason [16:18:18] Jul 10 14:32:53 cp6002 puppet-agent[1238025]: Disabling Puppet. [16:20:17] do you think is safe to enable it? [16:20:41] cumin1001 issued cmd apparently [16:20:57] cause.. Jul 10 14:32:51 cp6002 sshd[1237948]: Accepted publickey for root from 2620:0:861:103:10:64:32:25 port 55404 ssh2: ED25519 SHA256:/YROuxwrYx4/kq1z+mFO7kFnjGLX5ebvk10b9fQoh5c [16:21:00] 2023-07-10 14:32:44,174 [INFO 4073695 cumin.transports.clustershell.ClusterShellWorker.execute] Executing commands [cumin.transports.Command('disable-puppet \'acmechief maintenance - vgutierrez\'')] on '196' hosts [16:21:27] volans: sure... but that comes with a proper reason [16:21:29] if was that, there should be the same maintenance reason [16:22:34] other hosts in drmrs complains on -operations for recent puppet run [16:22:40] volans: or something farted and the reason got lost? [16:22:56] that's weird indeed [16:23:24] and I remember not getting any errors on disable-puppet and enable-puppet [16:24:09] 100.0% (196/196) success rati0 [16:24:45] yep [16:24:56] but I don't see any other disable-puppet cmd logged in cumin1001 yesterday [16:25:06] around that time [16:25:44] so I'm reenabling it if that's ok with you volans [16:25:50] unless you wanna do further debugging [16:26:01] gve me a sec [16:26:03] sure [16:26:04] are you re-enabling also on the other hosts ? [16:26:13] fabfur: that's already the case [16:26:28] puppet-enabled only reports it disabled on cp6002 for the whole A:cp [16:26:50] so it's Icinga that is misbehaving? [16:27:07] nope, I think j.bond disabled it for a while [16:27:12] ah ok [16:27:14] and now it's recovering [16:27:43] the fact that those are in drmrs too is a coincidence then [16:27:52] apparently :) [16:27:57] vgutierrez: go ahead for me [16:28:00] volans: ack [16:30:02] fabfur: your new ssh key just made it to cp6002 ;P [16:30:22] finally :) [16:30:27] all good theere [16:30:29] *there [16:31:49] fwiw the lock file was created at 2023-07-10 14:38:42.011402732 [16:45:56] vgutierrez: I think it could have been either a bug or a race condition, from the timing the lock file was created a the time of the re-enable [16:46:12] 2023-07-10 14:38:33,577 ... .Command('enable-puppet \'acmechief maintenance - vgutierrez\'')] on '196' hosts [16:46:36] 2023-07-10 14:38:46,054 ... Completed command 'enable-puppet 'acmechief maintenance - vgutierrez'' [16:46:56] volans: like I've issued the enable-puppet before disable-puppet finished? [16:47:03] that didn't happen BTW [16:47:20] no no a race in the enable-puppet [16:47:39] Tue 2023-07-11 16:38:00 UTC 9min ago puppet-agent-timer.timer [16:47:55] same minute... [16:48:15] so I guess yesterday the puppet run was at the same time [16:51:48] vgutierrez: this is "fun"... [16:51:49] Jul 10 14:38:41 cp6002 puppet-agent[1241130]: Enabling Puppet. [16:51:50] Jul 10 14:38:42 cp6002 puppet-agent[1241133]: Skipping run of Puppet configuration client; administratively disabled (Reason: ''); [16:52:33] aka, so double check with puppet-enabled afterwards [16:52:34] :) [16:52:41] s/aka/ok/ [16:53:00] we should probably have the enable script to do that to be sure [22:13:06] akosiaris: regarding the potential ingest/packet loss problem for k8s services (following up from hackaton), any update on what or where we think we cause is? [22:13:18] I'm seeing another service that may be suffering from the same thing, ref T341634 [22:13:19] T341634: MediaWiki frequently receives HTTP 500 from AQS (via PageViewInfo extension) - https://phabricator.wikimedia.org/T341634