[07:35:56] It seems mw-experimental is missing several weeks worth of /srv/mediawiki changes. According to https://wikitech.wikimedia.org/wiki/Mw-experimental these are updated on every scap run. [07:36:13] krinkle@wikikube-worker-exp1001:/srv/mediawiki$ ls -l [07:36:13] Oct 8 18:40 php-1.45.0-wmf.22 [07:36:13] Oct 14 03:02 php-1.45.0-wmf.23 [07:36:18] There is no wmf.24 or wmf.25 [07:37:01] I ran `helmfile -e eqiad -i apply` just now from a deployment server, which did have a 7 day diff, but that doesn't explain a 2 week gap, and only changes the container, not the /srv/mediawiki mount, right? [07:37:20] - name: mediawiki-pinkllama-app [07:37:20] - image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-10-21-102842-publish-81 [07:37:20] + image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-10-28-201802-publish-81 [07:38:14] interesting, the one in codfw does have the latest code [07:38:37] maybe scap is only syncing to the current/primary dc and missing awareness of the other one? [08:45:16] <_joe_> Emperor: https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/98 unless we've missed some important step, conftool 6 should be ready to import from apt-staging [08:45:24] <_joe_> you might need to add it to reprepro though [08:46:32] ack, thanks. [08:47:34] <_joe_> yep, all exported AFAICT https://apt-staging.wikimedia.org/wikimedia-staging/pool/main/c/conftool/ [08:48:26] <_joe_> anyone remembers which is the repo where I have to add repositories to be indexed by codesearch? [08:53:41] https://gerrit.wikimedia.org/r/plugins/gitiles/labs/codesearch/+/refs/heads/master/write_config.py [09:06:35] hi, there is an error in https://spiderpig.wikimedia.org/jobs/820 [09:06:55] "WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mediawiki-dumps-legacy-deploy-dse-k8s-eqiad.config" [09:08:17] kostajh: Not an error, it's fine [09:08:35] ok [09:11:14] claime: it seems to have interfered with the backport https://spiderpig.wikimedia.org/jobs/820 [09:11:37] but I can't tell for sure [09:11:44] <_joe_> kostajh: I would not think it did [09:11:45] I do see `Finished sync-prod-k8s (duration: 06m 00s)` [09:11:56] Followed by `09:10:23 backport failed: Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync',` [09:12:09] you're confusing that warning with the error directly below it [09:12:22] Yeah [09:12:40] Something went wrong with the actual release deployment for dumps [09:13:10] ok. Well, IMO it's confusing. But as long as you're not worried, then I won't be either :) [09:13:25] brouberol: btullis ^^ Can you check why mw deployments fail on dumps? [09:14:09] I should be OK to continue with another backport, then? [09:14:53] It will fail the same [09:15:04] looks like the k8s api server for dse-k8s is in trouble? [09:15:14] The connection to the server dse-k8s-ctrl.svc.eqiad.wmnet:6443 was refused - did you specify the right host or port? [09:16:21] manually curling that from deploy2002 works fine though? [09:16:39] I'm away this week, sorry. [09:16:46] Yeah may have been a blip, kubectl works [09:16:57] I'll try and redeploy manually before you try another backport kostajh [09:17:11] claime: ok, thanks [09:19:33] kostajh: good to go [09:20:17] claime: thank you! [09:30:01] claime: sorry, I'm only getting this now [09:30:07] reading scrollback, 2s [09:30:08] brouberol: all good mate [09:30:14] no real problem, just a blip [09:30:50] blip blop [09:31:56] We have been invaded by the robots and brouberol is the first victim [09:36:06] brouberol has been assimilated. [09:36:16] all what remains is yaml now [09:36:29] yeesh, I thought the Borg were bad enough, but yaml? [09:36:52] "Someone set up us the whitespace" [09:43:52] we're entering un-chart-ed territory. It's not just empty space. It's whitespace. [09:45:07] no alpha channel? [10:09:14] hey folks, I just depooled maps in codfw to move all traffic to the new eqiad stack, and test it for a day or two [10:09:29] no issues registered so far with the new stack, everything looks good [10:10:16] task with all the info: https://phabricator.wikimedia.org/T381565 [10:10:36] In theory we should be able to nuke the old nodes next week if everything goes as planned [13:44:42] puppet PCC builds are currently broken for prod (and change) with `Error while evaluating a Function Call, DNS lookup failed for es1030.eqiad.wmnet` https://puppet-compiler.wmflabs.org/output/1199778/7495/deploy1003.eqiad.wmnet/prod.deploy1003.eqiad.wmnet.err [13:45:23] this is coming from a puppet function that's called `ipresolve`, which is failing to resolve a given hostname (presumably for a host that was recently terminated) [13:45:47] I'm running the PQL query on the puppetdb host, and I'm not seeing es1030.eqiad.wmnet as part of the output [13:46:40] cf https://phabricator.wikimedia.org/P84345 [13:47:31] which is leaving me scratching my head a bit [13:49:36] yeah certainly interesting given 1030 was decommed as well [13:49:42] and is not also not in site.pp [13:50:01] and not even in the PQL response [13:50:10] yep, that one being more direct [13:50:26] could it be coming from the ci private repo, somehow? [13:51:03] remember that pcc has its own puppetdb instance that is refreshed (iirc) daily with a cronjob from the live instance [13:51:10] ooh, TIL [13:51:18] host was decommed on the 22 though. [13:51:44] https://sal.toolforge.org/log/7eEqDJoB8tZ8Ohr04VRf [13:52:09] (actually, puppet PCC builds *for changes on the deployment server* are broken. Not side-wide.) [13:53:54] ipresolve($mariadb_hostname, 4) [13:54:02] $res = wmflib::puppetdb_query($pql) [13:54:34] so it's quite clear that this is returning es1030 for some reason and then the resolve returns an NXDOMAIN (that bit being expected) [13:54:54] if somehow the CI puppetdb was not refreshed, that could explain why [13:55:15] btw I wrote that function, so I'll gladly take the blame there [13:55:33] yeah that seems to be the most rationale explanation right now, given that the PCC output [13:57:32] taavi: would you happen to know more about how the sync is performed? [13:57:38] brouberol: can you try running the PQL on pcc-db1002.puppet-diffs.eqiad1.wikimedia.cloud [13:57:51] brouberol: should be https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Manually_update_cloud [13:58:07] not a lot more than what is written on https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Updating_nodes [13:59:06] somehow I can't ssh on the node [14:00:38] ah nvm, pebkac [14:02:06] brouberol@puppetserver1001:~$ sudo systemctl status upload_puppet_facts.service [14:02:06] ... [14:02:06] Oct 29 12:10:40 puppetserver1001 systemd[1]: upload_puppet_facts.service: Deactivated successfully. [14:02:06] Oct 29 12:10:40 puppetserver1001 systemd[1]: Finished upload_puppet_facts.service - Upload facts export to puppet compiler. [14:02:06] Oct 29 12:10:40 puppetserver1001 systemd[1]: upload_puppet_facts.service: Consumed 3min 54.939s CPU time. [14:02:25] so that seems to be working, every day [14:03:20] yeah otherwise we would have noticed that elsewhere [14:03:29] can you run https://phabricator.wikimedia.org/P84345 on that node and see the output you get? [14:07:46] I can't seem to be able to reach pcc-db1002.puppet-diffs.eqiad1.wikimedia.cloud either on port 22 or 8080 from puppetserver1001 or puppetdb1003 [14:09:18] sigh, nevermind me. My daughter is playing full volume in my office, it's not helping. I could connect from my laptop [14:09:28] brouberol: cloud and prod don't connect that way yeah [14:09:32] and, we discussed a very similar issue in this channel last Friday [14:09:48] I think the conclusion we reached was that, once decommed, hosts take 7 days to age out of the PCC puppetdb [14:10:02] (which can affect compilation of other hosts via exported resources ofc) [14:10:19] is that in some cases or all cases of decomm? [14:10:33] because I would think that we would see more instances of that (PCC failures) if it was in all cases [14:10:36] I can confirm that running the same query against the cloud puppetdb returns es1030 [14:11:03] See https://phabricator.wikimedia.org/P84345#338876 [14:11:53] ok so that's clearly it [14:12:33] if given that cdanis says this happend last week as well (I vaugely recall that conversation), clearly something is missing in this update [14:12:52] in theory you could merge the change and prod should be fine and PCC should catch up, but yeah, this is not ideal [14:13:21] let's try manually updating cloud and see if that helps? [14:13:27] I am running it [14:13:49] hmm that snippet needs an update clearly [14:14:02] unrelated to the pcc issue, but I wonder if the code could use the node IP address from puppetdb instead of resolving the DNS name? [14:14:33] the expiration thing is a puppetdb feature, not something we implemented in the update scripts [14:15:43] sukhe: so the issue is when other nodes that you are trying to compile, then collect exportedresources from the hosts that no longer exist [14:16:02] this is *mostly* rare -- the most common case of cross-host exported resources used to be icinga alerts or nrpe or something [14:16:17] but ther's still cases where it happens esp when members of a cluster all want each other's IP addresses or other data [14:16:24] and reference puppetdb for that [14:17:10] taavi: that may be more problematic because a puppetdb lookup could work whereas resolve() would fail as the auth servers are supposed to be the authoritative source for the IP, not puppetdb [14:17:36] cdanis: ok thanks. I think this could warrant a task, given that this is the second instance in a short time [14:17:39] the Friday thing and now [14:18:01] taavi: I'm actually looking into this atm, good call [14:19:44] sukhe: let's dump puppetdb to /etc/hosts and run dnsmasq on the wmcs pcc compilation hosts [14:19:44] brouberol: IMO that can lead you to more corner cases, depending on what you are trying to do. it's always safest to use DNS [14:19:57] cdanis: :] [14:20:21] it could be a puppetdb_query.py | jq one liner [14:20:25] and it would work :D [14:20:33] you just need an excuse to do jq [14:21:31] sukhe: https://github.com/noperator/sol from a kindred spirit [14:22:19] cdanis_one_liners | sol [14:23:24] <_joe_> the problem of populating puppetdb for pcc is just that we have small VMs to run it. We could refresh daily if we had more power, but I'll defer to jhathaway for that [14:23:45] _joe_: it'd be enough if we could push a deletion to pccpuppetdb upon a decom [14:24:06] yeah that seems to be the thing that is missing, somehow now [14:24:08] <_joe_> cdanis: we ofc can [14:24:09] it used to work just fine [14:24:10] I didn't get as far as looking at the postgresql schema [14:24:19] <_joe_> sukhe: what worked just fine? [14:24:30] sukhe: idk I think it's always worked this way, it's just that more things use exported resources nowadays [14:24:33] host deletion and pcc puppet db being updated [14:24:46] <_joe_> no it always updated weekly on the weekend [14:24:56] <_joe_> because during the update, pcc doesn't really work well [14:25:16] _joe_: how much more resources are we talking about? [14:25:19] that was deep lore I didn't know [14:25:36] Instances [14:25:38] Used 9 of 9 [14:25:40] VCPUs [14:25:42] Used 20 of 26 [14:25:44] RAM [14:25:46] Used 40GB of 52.1GB [14:25:56] <_joe_> taavi: I wouldn't know, last time I looked we were on puppet 5 :P [14:26:28] <_joe_> but I'm sure that with a larger puppetdb or multiple instances we could populate every day and swap in the most updated db [14:26:32] <_joe_> stuff like that [14:26:32] I can look at running it daily, I don't recall running any benchmarks on what horse power is needed [14:26:53] taavi: would you like to be a member of the `puppet-diffs` horizon project or would granting that to you be superfluous? 😇 [14:27:02] oh wait, you are already [14:30:15] cdanis: I can see everything in horizon already (plus this is public at https://openstack-browser.toolforge.org/project/puppet-diffs regardless) [14:35:35] I finally got a working PQL query to get the ipv6/4 for a given fqdn. Is it worth pursuing to remove DNS, and these kinds of discrepancies out of the way? [14:36:53] brouberol: there's some tunnelencabulator code you can reuse for manipulating /etc/hosts [14:40:01] okay, in seriousness brouberol -- https://www.puppet.com/docs/puppetdb/7/api/admin/v1/cmd#delete-version-1 [14:40:22] FTR the query is `inventory[certname,facts.networking.ip,facts.networking.ip6] { certname ~ 'es' and certname ~ 'eqiad.wmnet' and resources { type = 'Class' and title = 'Role::Mariadb::Core' } }"` [14:42:26] with a `order by certname` for good measure [14:43:12] can you try that delete command against the same endpoint as you're querying? [14:43:18] brouberol: have you looked at our existing functions if there is anything useful? (things like wmflib::class::ips or wmflib::resource::ips or the more generic for querying pupped [14:43:33] if that's something you need to puppetize [14:45:57] at the time, yes, but (massive IIRC) given that many of the the mysql servers have the `mariadb::core` role, and I only wanted a subset of them, PQL ended up being the way to go [14:46:02] ^ volans [14:46:20] k [14:46:31] cdanis: I was querying puppetdb1003 at the time. I gather I should try this against the pcc puppetdb ? [14:46:44] brouberol: yes [14:46:49] trying [14:48:24] { "deleted" : "es1030.eqiad.wmnet" } [14:48:42] and it's no longer returned by the PQL query now [14:51:17] I removed es1031 and 1032 as well, as they were no longer present in the output when querying prod [14:51:39] so first off thanks [14:51:57] trigger your PCC run again just to be sure, but, cool [14:54:02] second, I thing it's worth removing DNS from the equation. For context, these functions are called to define IP/ports behind the external-services k8s Service resources, used by calico to defined network policies. If we remove node X from production, it might stick around in the PQL output from the PCC puppetdb, meaning its ip will be removed from [14:54:02] the firewall rules in up to 7d, which is probably better to impede our ability to ship changes safely during that time. WDYT? [14:55:06] the change would be pretty minor on my end, and the DX improvement might be worth it [15:03:56] (I remove es2026.codfw.wmnet as well) [15:03:59] *removed [15:04:39] brouberol: I guess it's up to you and your requirements but where I was coming from is that if a host is decommed, you will most certainly (last famous words?) get NXDOMAIN from the auth servers [15:04:45] whereas puppetdb and the decom and all that layers, I am not sure [15:04:48] but yeah, of course your call [15:05:49] I can confirm that with these deletion from PCC puppetDB, PCC is now as happy as a bee [15:06:20] cool [15:06:47] sukhe: thanks. I think it makes sense to tolerate "stragglers" instead of failing PCC when we delete a host, in that particular context [15:07:08] I'll create a ticket. Thanks taavi for the suggestion [15:07:46] we probably should implement some sort of mechanism where prod decomms trigger pcc-pdb deletions [15:09:00] indeed [15:16:52] T408706 [15:16:53] T408706: MariaDB host decommissioning causes PCC failures for the deployment hosts - https://phabricator.wikimedia.org/T408706 [15:39:29] nice, PCC shows a NOOP [17:37:18] possibly dumb question: did something change about netbox API access? it used to work, now I'm getting a 403 and I can't figure out why [17:40:39] Raine: set a user agent [17:40:42] possibly [17:41:01] oh [17:41:03] yeah [17:41:12] well that'd suck, I don't think pynetbox can do that [17:41:43] You can set headers [17:41:54] https://pynetbox.readthedocs.io/en/latest/advanced.html [17:42:10] right [17:46:37] I think something like "self.http_session.headers.update({"User-Agent": user_agent})" will make it default for all following requests. [17:46:51] yep, that's it, thanks a ton claime <3 [17:47:14] mutante: the docs suggest a slightly different way, I suppose they're equivalent [17:47:36] thank you too <3 [17:47:48] it's working in my python console now :D [17:47:49] probably, yea. just listen to those docs [17:47:53] great [17:57:05] claime: no good deed goes unpunished, https://gitlab.wikimedia.org/repos/sre/serviceops-kitchensink/-/merge_requests/25 pretty please if you're still around :D [17:57:33] Oh please no don't distract me from writing ats lua please lord no [17:57:49] XD [17:58:04] You have merge rights? [17:58:22] looks like it [17:58:24] thank you <3 [18:25:18] swfrench-wmf: php8.3 is coming along nicely I see. 50% of cookied mw-web already! [18:25:28] did you guys do something specific to make ssh agent work with the yubikey on Debian? I can login but have to type the passphrase each time (and the docs told me it's less important but I should still set one). I can ssh-add it but "agent refused operation". That issues seemed to be that it has no GUI to ask for the user presence click and installing ssh-askpass was supposed to fix that (though I [18:25:34] dont want GUI).. but it did not. [18:26:33] Krinkle: yeah! I'm (very) cautiously optimistic as we chug along :) [18:27:55] as we saw last time, the actual fraction of traffic migrated is well less than the enrollment fraction, but still - a sizable amount of traffic on 8.3 at this point [19:18:19] the SSH agent should handle it like every other key, it worked for me out of the box [19:19:06] if you have multiple SSH agents, maybe you miss the SSH_AUTH_SOCK variable to the correct socket? [19:23:04] mutante: how did you create your key? [19:23:09] did you create a resident key? [19:24:04] that's one of the ways that you can wind up needing to type the passphrase each time [19:24:18] cdanis: I followed this https://wikitech.wikimedia.org/wiki/Yubikey-SSH-FIDO#1._Generate_new_SSH_key_pair(s) [19:24:34] from the mail about moving to FIDO-backed [19:24:35] what binary is on the other side of your ssh-agent socket? [19:26:29] the socket is at /run/user/1000/keyring/ssh and /usr/bin/ssh-add is how I addet it to the agent [19:27:59] mutante: sudo fuser -av /run/user/1000/keyring/ssh [19:28:02] sign_and_send_pubkey: signing failed for ED25519-SK is the actual error [19:28:27] gnome-keyring-d [19:28:32] there's your problem :) [19:28:37] I am on MATE desktop (gnome2 continued) [19:29:06] if you want to check, just `eval $(ssh-agent)` in a shell and then try adding it (and then ssh'ing to prod) [19:31:40] arrg.. now that I installed ssh-askpass ..it ops up a GUI window hidden behind every other open window.. that asks for the user presence click [19:31:56] but other than that.. yes I can login. thank you [19:32:00] for me it pops to the front heh [19:32:32] so I did have to install ssh-askpass though..right? [19:32:45] or it failed because there was no GUI for that [19:32:51] tries [19:33:37] idk I'm on recent kde plasma on trixie and everything just worked [19:33:58] yea, so if I remove that package.. then it just sits there and does nothing [19:34:10] wish I would just get asked without GUI [19:34:29] so when I googled about this it told me installing ssh-askpass is the fix.. and it was [19:34:41] just that the second part was I did not see the window [19:34:58] mutante: perhaps you would prefer ssh-askpass-fullscreen [19:35:45] aha! oh.. there is also ssh-askpass-gnome [19:36:14] what I really want is that to be a TUI, but ok:) [19:37:19] aah! ssh-askpass-gnome works better for me than ssh-askpass. it doesnt do the background window thing:)