[06:36:33] <Krinkle>	 swfrench-wmf: this might be nothing but looking at `ps aux` during a scap run on beta I noticed:
[06:36:53] <Krinkle>	 ```
[06:36:53] <Krinkle>	 jenkins … /bin/sh -c sudo -u www-data -n PHP="php7.4" -- /usr/bin/scap cdb-json-refresh --directory="/srv/mediawiki-staging/php-master/cache/l10n" --threads=2  
[06:36:53] <Krinkle>	 root     … sudo -u www-data -n PHP=php7.4 -- /usr/bin/scap cdb-json-refresh --directory=/srv/mediawiki-staging/php-master/cache/l10n --threads=2
[06:36:53] <Krinkle>	 ```
[06:37:36] <Krinkle>	 on the bright side, `php7.4: command not found` so it's not relyng/trying on it working.
[06:37:49] <Krinkle>	 but at the same time, it's sus :)
[06:38:06] <_joe_>	 Krinkle: that seems like a beta-related issue as scap isn't really using the same codepaths in that env and prod
[06:38:12] <_joe_>	 but yes it's not sus, it's wrong :)
[06:38:43] <_joe_>	 oh, I'm fixing https://phabricator.wikimedia.org/T404826#11303839, apologies, that was my fault
[06:40:43] <Krinkle>	 _joe_: ack, I've been carrying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198375 on top of puppet things so that PCC/VTC pass
[06:41:12] <Krinkle>	 for beta I was lucky to get this weeks' worth of changes applied just before reload stopped working :)
[06:41:43] <Krinkle>	 m-dot is now redirecting everywhere to standard
[06:41:51] <Krinkle>	 preparing to dropping purges from MW as we speak
[06:41:59] <_joe_>	 Krinkle: you can apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198424 instead now
[06:42:26] <Krinkle>	 yeah saw, I migth do a puppet change next week to remove some feature flags but for this week I'm all done
[06:43:56] <_joe_>	 I think feature flags are something to keep if we want to be able to bring up test environments with some flexibility btw. My bad for doing a patch in hasste and not adding it
[06:48:13] <Krinkle>	 I meant the temporary feature flags for mobile redirects rollout
[06:49:10] <Krinkle>	 once the purges are gone we won't want to make it easy to start serving MW directly from m-dot. not that it's hard either, it's synth(302) vs setting two headers (Host + X-Subdomain:M)
[06:57:48] <_joe_>	 ah yes fair sorry
[07:10:50] <Krinkle>	 oops, looks like we've been redirecting PURGE as well the last 8 hours.
[07:11:13] <Krinkle>	 they'll be gone in a minute but I regret not thinking about that
[07:11:20] <Krinkle>	 https://grafana.wikimedia.org/d/000000464/varnish-aggregate-client-status-code?orgId=1&from=now-24h&to=now&timezone=utc&var-site=codfw&var-site=eqiad&var-site=eqsin&var-site=esams&var-site=ulsfo&var-site=drmrs&var-cache_type=varnish-text&var-status_type=3&var-method=PURGE
[07:15:20] <_joe_>	 oh so we were purging twice or just responing 301?
[07:17:33] <Krinkle>	 responding 307, but yeah I assume purged doesn't follow a redirect
[07:17:58] <Krinkle>	 I don't knwo whether the purge still "worked" or not but there is nothing for it to purge anyway 
[07:19:06] <Krinkle>	 until yesterday, MW purges both variant URLs, then earier today we started redirecting m-dot to standard with 30x at which point the 20% of purges that are m-dot purged.go received HTTP 307 instead of HTTP 204. And as of 5min, those purges are gone :)
[07:19:29] <Krinkle>	 on the bright side, it gave us 8 hours of excellent telemetry on which portin of purges we're shedding
[07:21:28] <Krinkle>	 at peak today we did 316K PURGE/s of which 128K were MW m-dot and 185K were other.
[07:21:54] <Krinkle>	 assuming 128K of 185K were MW desktop
[07:22:47] <Krinkle>	 that meant at that peak we had 40%/40% MW and 20%/other (e.g. changeprop?)
[07:22:59] <Krinkle>	 but most times a day, the MW portion is smaller
[07:23:37] <Krinkle>	 e.g 100K/s with 20K MW (10K m-dot, 10K desktop) and 80K other services
[07:24:20] <Krinkle>	 that's quite a lot aye?
[07:25:44] <Krinkle>	 was expecting a bigger dip in the "normal" rate, but instead it's mainly going to cut the peaks. I wonder what all that baseline non-MW purges are
[07:40:49] <Krinkle>	 yeah, nearly all for PURGE /api/rest_v1*
[07:41:56] <Krinkle>	 plus there's a bot alphabetically purging every page on en.wiktionary.org at a slow pace
[08:07:25] <_joe_>	 klausman, dpogorzelski: I don't think https://gerrit.wikimedia.org/r/c/operations/alerts/+/1198321's rationale really coincides with the reason we have that check, which is to make sure we can do a full deployment of the current state of the git repo at any time if we need to e.g. rebuild a cluster
[08:07:57] <_joe_>	 I don't think we need to go full CD, but it makes sense not to leave admin changes unapplied for a longer stretch of time
[08:08:30] <_joe_>	 I also get you don't want a noisy alert, so maybe a workflow change might work better in this case?
[08:08:42] <klausman>	 For the ML k8s, the thing is that about every 2-3 days we get an alert about updated IPs in the netpol section that don't really make a difference for our use
[08:09:41] <klausman>	 I think that in an ideal world, safe changes to the netpol stuf should just be pushed, but when I brought it up in the k8s SIG, it was mentioned that in the past there had been bugs that wiped a whole IP list and thus caused breakage
[08:09:51] <klausman>	 So "safe change" might be dificult to evaluate
[08:14:46] <_joe_>	 so the solution would be to break down network policies into common parts and wikikube-only parts?
[08:15:23] <_joe_>	 I'm trying to think of the best solution, it might very well be there isn't one
[08:16:01] <dpogorzelski>	 "alert about updated IPs in the netpol section that don't really make a difference for our use" after years of alerts in other contexts I ended up leaving only those that required immediate intervention
[08:17:14] <dpogorzelski>	 there's some middle ground there for potentially growing problems over time but otherwise it becomes INFO noise
[08:17:21] <_joe_>	 dpogorzelski: I agree that isnignificant alerts should be avoided, so I was trying to solve the issue rather than lagging alerting
[08:17:52] <klausman>	 Thing is _sometimes_ there is useful stuff that needs pushing, like updates to the scarped k8s metric labels. And of course the problem Joe mentioned that if you don't regularly push, changes accrue and you suddenly have a large pile of things that you have to push alongside the small change you wanted to make.
[08:18:24] <dpogorzelski>	 aye, personally i do prefer regular pushes for that reason
[08:18:32] <klausman>	 So far, we haven't had a situation where that resulted in a risky push or outage, but I'd rather not wait until we do :)
[08:19:40] <dpogorzelski>	 with multiple changes at once, if something breaks, you also don't know immediately which one broke things, so there's also that
[08:20:34] <klausman>	 I think the tl;dr is: unless we find a way to evaluate (partial) changes as safe, we'll have to find a balance in alerting frequency. 
[08:23:16] <elukey>	 personally I think 1w is an ok time for that check, I've seen it left unattended so many times for other clusters when checking the karma ui
[08:24:02] <elukey>	 https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=admin_ng
[08:24:34] <elukey>	 but yeah we want to make sure it is not left unattended for too much time
[08:25:07] <klausman>	 Oh, and of course our change only affects the ML clusters, so if other k8s's want more frewuent alerts, that's already the case
[08:25:25] <_joe_>	 I don't have direct experience with that cluster, if 1 week is reasonable to spot differences that's great, but my point about the netpol probably needing a split stands 
[08:25:55] <elukey>	 yep yep that for sure
[08:26:20] <klausman>	 Agreed
[08:29:02] <_joe_>	 now, ofc doing so with helm and its warped abstraction on top of k8s manifests will be a pain
[11:02:51] <akosiaris>	 !sing
[11:02:51] <sirenbot>	 Never gonna give you up
[11:02:52] <sirenbot>	 Never gonna let you down
[11:02:53] <sirenbot>	 Never gonna run around and desert you
[11:02:53] <sirenbot>	 Never gonna make you cry
[11:02:54] <sirenbot>	 Never gonna say goodbye
[11:02:55] <sirenbot>	 Never gonna tell a lie and hurt you
[11:03:06] <Raine>	 oh no
[11:03:56] <akosiaris>	 LoL 
[11:04:01] <Raine>	 that the new handoff jingle? :-D
[11:04:29] <akosiaris>	 Trying to wake up the bot actually, but I like how you think
[11:04:38] <akosiaris>	 Anyway 
[11:04:41] <Raine>	 I see :-D
[11:04:46] <akosiaris>	 Handoff: no pages to report
[11:04:53] <akosiaris>	 Have a nice shift!
[11:05:04] <akosiaris>	 An utterly uneventful one
[11:05:49] <Raine>	 I've been assured that sirenbot will never make me cry, so uneventful seems likely :-D
[11:05:55] <Raine>	 thanks :-D
[11:08:45] * claime goes crashing into the rocks
[11:08:52] <claime>	 Who made sirenbot sing :'(
[11:46:06] <Emperor>	 :)
[11:53:26] <jynus>	 _joe_ Raine : I am running 2 test transfers, once cross dc, the other in eqiad FYI. They should not cause issues, but in case they do, they are managed from root screen on cumin1002 (a single control^c would clean them)
[11:53:47] <Raine>	 ack, thanks jynus!
[11:54:16] <jynus>	 we do backups like those all the time, so volume should not be an issue, but it is a new software version, so one never knows
[12:59:32] <godog>	 FYI I've temporarily enabled "pause before reboot" for d-i on apt1002 and puppet disabled
[14:36:18] <swfrench-wmf>	 Krinkle: thanks for flagging! so, the good news is that `cdb-json-refresh` doesn't run php at all under the hood (all python). what's interesting about that command-line is that scap forwards the `PHP` env var from the environment - i.e., `PHP` is set to `php7.4` in whatever context scap was invoked
[14:54:05] <Krinkle>	 swfrench-wmf: right so could be a scap.ini, or Jenkins/zuul.
[16:02:01] <andrewbogott>	 mutante, Amir1, Krinkle, I randomly selected you from a list of codesearch admins.  O
[16:02:06] <andrewbogott>	 Is there a quick fix for "Wikimedia\Codesearch\ApiUnavailable: Hound request failed: Failed to connect to 172.17.0.1 port 3002 after 0 ms: Couldn't connect to server " ?
[16:02:18] <Krinkle>	 andrewbogott: looking..
[16:03:09] <andrewbogott>	 thx
[16:03:22] <taavi>	 T408218 / T408221, most likely
[16:03:23] <stashbot>	 T408218: Codesearch down/unavailable (2025-10-24) - https://phabricator.wikimedia.org/T408218
[16:03:23] <stashbot>	 T408221: CodeSearch seems to be out of disk space - https://phabricator.wikimedia.org/T408221
[16:07:23] <Krinkle>	 oh there you go
[16:18:30] <mutante>	 !log codesearch9.codesearch truncate -s 0 /var/log/account/pacct -> disk space from 100% used to 37% used T408221 T408218
[16:18:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:36] <stashbot>	 T408221: CodeSearch seems to be out of disk space - https://phabricator.wikimedia.org/T408221
[16:18:36] <stashbot>	 T408218: Codesearch down/unavailable (2025-10-24) - https://phabricator.wikimedia.org/T408218
[16:19:18] <mutante>	 andrewbogott: 12 GB was used by /var/log/account/pacct - process account. it was full of just git commands from September 3 and before
[16:19:24] <mutante>	 accounting
[16:19:51] <mutante>	 truncate instead of rm  keeps it open for the process to write to 
[16:20:08] <andrewbogott>	 huh, I see pacct run rampant like that now and then but I don't know why/when it decides to write so much
[16:21:21] <mutante>	 it might miss a logrotate.d snippet to rotate it 
[16:21:46] <mutante>	 i tried adding one but removing it again as it's not puppetized
[16:25:01] <mutante>	 !log codesearch9.codesearch systemctl restart hound-operations T408218
[16:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:06] <stashbot>	 T408218: Codesearch down/unavailable (2025-10-24) - https://phabricator.wikimedia.org/T408218
[16:25:30] <mutante>	 https://codesearch.wmcloud.org/_health/  shows it is slowly coming up again
[16:34:01] <Krinkle>	 thanks mutante 
[16:35:43] <mutante>	 thanks for handling the tickets
[16:35:43] <Krinkle>	 worth a follow-up task perhaps?
[16:40:37] <mutante>	 please don't apply "who touched it last" decision tree. it creates a negative incentive.
[16:42:21] <brett>	 or *do* apply it but just always point it to mutante :)
[16:45:36] <andrewbogott>	 I wasn't pinging people who touched it last, just people whose nicks autocompleted and were on the admin list :)
[18:11:35] <swfrench-wmf>	 is it possible for PCC's view of puppetdb (i.e., nodes that exist) to get out of sync with production?
[18:11:35] <swfrench-wmf>	 context: I'm seeing compilation failures [0] in [1] related to es1027, which was decommed earlier this week [2]. these failures do not happen in production.
[18:11:35] <swfrench-wmf>	 [0] https://puppet-compiler.wmflabs.org/output/1178657/5187/deploy1003.eqiad.wmnet/prod.deploy1003.eqiad.wmnet.err
[18:11:35] <swfrench-wmf>	 [1] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/functions/kubernetes/deployment_server/mariadb_external_storage_ips.pp
[18:11:35] <swfrench-wmf>	 [2] https://phabricator.wikimedia.org/T407595
[18:13:05] <rzl>	 swfrench-wmf: hm, are you hip to https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Updating_nodes already?
[18:13:39] <swfrench-wmf>	 ah, TIL!
[18:13:59] <rzl>	 ("a recent update" was January 2022, just because I was curious)
[18:21:45] <mutante>	 it is supposed to happen automatically though. it used to be manual thing before that.
[18:39:16] * swfrench-wmf is puzzled
[18:39:51] <swfrench-wmf>	 the most recent upload_puppet_facts and pcc_facts_processor runs appear to have succeeded without issue
[18:40:02] <sukhe>	 I have had mixed results with that tbh
[18:40:17] <sukhe>	 meaning that you would expect a manual update to fix stuff but it really doesn't
[18:40:24] <cdanis>	 swfrench-wmf: I think there might be specifically a problem with decommed nodes
[18:40:42] <swfrench-wmf>	 ah, that's possible
[18:41:59] <cdanis>	 ah!
[18:42:01] <cdanis>	 swfrench-wmf: https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Purging_nodes heh
[18:43:12] <cdanis>	 I think you can just ssh to the pcc puppetdb host and check that directory mentioned in the rsync command there
[18:47:07] <swfrench-wmf>	 cdanis: so, what's curious is that `var/lib/catalog-differ/puppet/yaml/production/yaml/facts/` contains no facts export file for es1027
[18:48:01] <cdanis>	 hmm
[18:48:07] <cdanis>	 oh
[18:49:45] <cdanis>	 okay so that is what the instructions are saying about step 2 vs step 3
[18:50:55] <cdanis>	 I don't know offhand, sorry Scott :\
[18:52:58] <swfrench-wmf>	 no worries at all! didn't mean to drag other folks down this rabbit hole :)
[18:53:04] <swfrench-wmf>	 thanks in any case!
[19:02:35] <swfrench-wmf>	 I wish the very last bullet point in [0] was a bit clearer if there's a way to purge the node from PCC's puppetdb directly (i.e., vs. just being a time window since the last time the node appeared in a sync from production).
[19:02:35] <swfrench-wmf>	 [0] https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Purging_nodes
[19:03:17] <cdanis>	 I suspect that the author took it for granted that people knew how to 'normally' purge nodes from puppetdb heh
[19:04:46] <swfrench-wmf>	 heh, yeah there's a gulf between "what I know how to do with `puppet node [...]` in production when I decom something" vs. "how to I project that onto whatever the heck is going on here" :)
[19:05:06] <swfrench-wmf>	 s/how to/how do/
[19:05:31] <cdanis>	 yeah
[19:06:50] <mutante>	 how about trying "puppet node deactivate es1027.eqiad.wmnet" 
[19:07:51] <cdanis>	 mutante: you need to do it against the pcc puppetmaster in wmcs
[19:08:18] <cdanis>	 swfrench-wmf: we could reduce profile::puppetdb::node_ttl I guess
[19:11:53] <swfrench-wmf>	 cdanis: I guess that would be an option, yeah. I'd feel more comfortable if there was just a documented mechanism to do the equivalent of clean / deactivate against the PCC puppetdb instance :)
[19:12:26] <swfrench-wmf>	 to be fair, it seems like the way this works now is probably fine for the vast majority of use cases
[19:12:59] <sukhe>	 my experience has been that most of the times for even new hosts, running the manual update never works
[19:13:07] <swfrench-wmf>	 the PQL in mariadb_external_storage_ips ... not so much
[19:13:25] <cdanis>	 there are a few use cases like that
[19:13:30] <cdanis>	 it's not uncommon
[19:13:30] <sukhe>	 and then there is probably a timer somewhere that I have never bothered to figure out that kicks in and then the facts are happy
[19:13:39] <sukhe>	 I have never bothered to look because it doesn't happen with all hosts
[19:13:42] <cdanis>	 cumin, netbox, a few others I can't remember at least
[19:13:52] <sukhe>	 and timing issues need a lot of time to debug so I just nature take its course :P
[19:15:08] <swfrench-wmf>	 I do wonder if other use cases like this are "easier" in that you're not assuming referential consistency between two distinct systems (i.e., puppetdb and DNS)
[19:15:37] <cdanis>	 there might not be an easy way to do it aside from deleting rows from the postgresql lol
[19:16:25] * swfrench-wmf nods
[19:16:58] <cdanis>	 there's not an actual puppetmaster here, there's just the puppet db, I think
[19:18:24] <mutante>	 andrewbogott: can you ssh root@pcc-db1002.puppet-diffs.eqiad1.wikimedia.cloud 
[19:19:00] <mutante>	 for some reason I have issues with my ssh config now.. but if I could.. I would offer to run that "puppet node deactivate" command there
[19:19:15] <andrewbogott>	 mutante: yes. Want me to do something while I'm there?
[19:19:40] <mutante>	 puppet node deactivate es1027.eqiad.wmnet
[19:19:43] <mutante>	 might help swfrench-wmf 
[19:20:13] <andrewbogott>	 ok, I haven't read the backscroll yet but I'll run it!
[19:20:33] <mutante>	 what we want is that the host disappears from puppetdb. it has been decom'ed
[19:20:45] <swfrench-wmf>	 andrewbogott: we're puzzling about how to get a node out of pcc's puppetdb
[19:20:47] <andrewbogott>	 it doesn't do much.
[19:20:50] <andrewbogott>	 https://www.irccloud.com/pastebin/IQFlSDFj/
[19:22:03] <mutante>	 self-signed certificate .. more rabbit hole?
[19:22:20] <mutante>	 well, that's just the warning though
[19:22:49] <mutante>	 andrewbogott: well, thanks for trying 
[19:23:27] <andrewbogott>	 sure. Sorry, I don't know a ton about how puppetdb population works there
[19:23:37] <swfrench-wmf>	 thanks for trying!