[08:55:19] morning [09:09:52] o/ [09:27:14] o/ [11:18:23] * dcaro lunch [11:56:49] GM [13:49:58] dhinus: andrewbogot.t fyi. just merged an alert to check toolsadmin.wikimedia.org, it's set as 'page' so if you see it firing let me know (it should not, but well, stuff happens) [14:00:41] dcaro: ack [15:12:55] ok! [16:54:20] I have been messing off and on with a wikitech-static alternative -- I'd be interested in someone here poking at it and telling me if it is obviously useless. https://wts.wmcloud.org/wiki/Main_Page.html [16:54:56] The search bar on the Main Page should work (although the search results are super ugly). The search bar on any other page just links back to live wikitech so isn't really of interest. [16:55:22] So far I haven't really found things that I can't get from that site that I can get from existing https://wikitech-static.wikimedia.org/wiki/Main_Page [16:58:20] dcaro: ^ is basically the only thing I would've had to say during our checkin [17:00:24] bd808: if you have clear evidence of non-dns network interruptions (which I think you do) can you add that to T374830? At your leisure since I'm trying to take a break from obsessing over that one. [17:00:24] T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830 [17:00:34] andrewbogott: the search bar redirects me to the to the real wikitech [17:01:12] dcaro: on the main page there's a different, uglier-looking search bar that uses lunr [17:01:24] that's the only search bar of interest for the moment [17:01:39] ah, got it [17:02:06] trying to search... [17:02:06] https://usercontent.irccloud-cdn.com/file/PmEEezDm/image.png [17:02:11] If this becomes a 'real' product I'd either dupe the front page search bar everywhere or maybe just remove the other search bars. [17:02:21] Oh yeah, the index is huge :) Give it a few more seconds. [17:02:56] oh, finished, yep it seems to give all results at once [17:02:56] I pruned a whole lot of pages out of the index, could probably prune out more... [17:03:26] hmm, repeated too [17:03:43] oh? What did you search for? [17:03:44] some at least, not all [17:04:03] Might be that the search string just appears multiple times on the page? Although I would sort of hope lunr would be smarter than tat [17:04:20] ah, not repeated no, different html file, but same content? [17:04:51] oh, maybe how redirects are 'staticified'? [17:05:00] could be, yeah [17:05:20] https://usercontent.irccloud-cdn.com/file/ss122Wrv/image.png [17:06:00] yep, I bet that's because of a 'this page has moved'. [17:06:06] I keep thinking that having the site be totally flat html w/out any php running in the background is a feature in case of disaster but that might be silly. [17:06:18] yep, it's a redirect [17:06:35] https://usercontent.irccloud-cdn.com/file/kr9pyLyx/image.png [17:06:55] it would be pretty easy to strip duplicate page titles from the search results if that happens a lot. [17:06:58] hmm, weird redirect though, I would have expected it the other way around (unrelated to the search xd) [17:07:06] ah, no, it's ok [17:07:22] 'this effort previously regarded as previous is now regarded as current, soon to be previous' [17:07:38] hahahahaa [17:08:11] there's a few of those in my search result at least (`build service`) [17:08:58] like 5 entries for `Help:Toolforge/Quickstart - Wikitech` xd, moved around a lot [17:11:21] some of the links (history, discussion, etc.) point to wikitech, that's expected right? [17:11:44] oh, the urls are actually to the live wikitech? [17:11:49] seems weird. [17:12:04] can you give me an example? [17:12:05] not all, only the history/source ones [17:12:17] I don't hate the pure static site idea, but I do think it is a slightly different solution than a mirror of Wikitech if the page titles/urls change beyond just the hostname. [17:12:19] click 'view history' on any page [17:12:43] oh, yes! Sorry, I thought you meant search links [17:12:58] yeah, when duping the site I tried to exclude things that wouldn't be useful for actual troubleshooting. [17:17:01] Did the search actually find you the things you were looking for? After the wait, I mean? [17:17:51] so far yes, maybe some results I did not find relevant, but the relevant ones are there too (and the non-relevant ones kinda made sense as they match the words) [17:18:27] Most of the ones I ran got me the page I was looking for as the first result, which is better than actual wikitech search :) [17:18:49] (which just suggests that we should really dump wikitech to a single flat file and use ctrl-f I think) [17:19:45] andrewbogott: the goal is to make wikitech-static //better// than wikitech :D [17:20:16] It will definitely not be faster! [17:20:50] you can always claim "accurate is better than fast" :D [17:22:07] * dhinus offline [17:23:40] I have not compared it to wikitech, let me see [17:24:24] hmm, nope, `toolforge python` gives me the one I want on wikitech as first, but 9th on wts [17:24:35] oh dang [17:30:00] There is a some search ranking tuning on Wikitech that might cause that. https://wikitech.wikimedia.org/wiki/MediaWiki:Cirrussearch-boost-templates makes pages with {{Toolforge nav}} or {{Cloud VPS nav}} 300% better when ranking. [17:30:25] ohhh, sweet [17:30:51] * andrewbogott wonders if lunr can get hints like that [17:33:40] it seems it has some boosting mechanism, not sure if you can do the same though, seems to apply only to search terms [18:17:29] * dcaro off [18:17:30] cya! [18:35:48] andrewbogott: in the deployment-prep alert stuff, I think these puppet stale cert alerts -- https://prometheus-alerts.wmcloud.org/?q=project%3Ddeployment-prep&q=alertname%3DPuppetStaleCertificates -- are there because the Prometheus collector that they trigger off of is gone. The /var/lib/prometheus/node.d/openstack_stale_puppet_certs.prom data file is timestamped 2024-07-17. [18:36:14] I am going to delete that data file unless you have an objection [18:36:19] no, please do [18:36:36] that exporter not being there sounds like a bug [18:36:51] it does unless it was replaced by a new exporter... [18:37:55] It may have been moved to a new module? I see ::prometheus::node_openstack_stale_puppet_certs in the puppet tree [18:38:58] There is also ::profile::openstack::base::puppetmaster::stale_certs_exporter which seems to just be a profile wrapper for the former [18:40:50] if there's no exporter at all wouldn't we get alerts about missing metrics? Or is that not a thing that exists in alertmanager? [18:41:04] (I mean, on other VMs without the stale file) [18:41:22] hmmm.. it looks like it should be there actually. The deployment-prep prefix Puppet has `role::puppetserver::cloud_vps_project` which itself wraps `profile::openstack::base::puppetmaster::stale_certs_exporter` [18:43:39] * andrewbogott asks cumin where there is or isn't /usr/local/sbin/prometheus-openstack-stale-puppet-certs [18:44:44] it seems like prometheus_openstack_stale_puppet_certs.service is in failed state [18:44:51] there are errors in journalctl from the timer that runs the script. [18:45:10] stuff like "keystoneauth1.exceptions.http.NotFound: Could not find project: maps-experiments." [18:46:21] * bd808 will make a bug [18:47:34] If project names are hardcoded someplace it's not in operations/puppet [18:47:52] bd808: what VM has the error about maps-experiments? [18:48:32] andrewbogott: deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud [18:48:37] huh [18:48:40] T383153 [18:48:41] T383153: prometheus-openstack-stale-puppet-certs crashing on deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T383153 [18:48:48] why is there a cert for a maps-experiments VM on the deployment-prep puppetserver? [18:50:22] that project seems to be gone too? [18:50:43] yeah, i found the same and so revoked the cert [18:51:01] maybe in pre-history someone pointed their VM at that puppetserver and messed with security groups to make it possible [18:51:07] could someone have pointed maps-experiments vms at the betga cluster puppetserver? [18:51:26] yes [18:52:01] still saying 'Retrying mwopenstackclients.Clients.novaclient in 8.972793087020957 seconds as it raised NotFound: Could not find project: maps-experiments.' [18:52:04] is there another one? [18:52:27] there are several, I will clean them up [18:53:04] oh wait, just one, my grep was too wide [18:53:35] just needed a better clean command [18:54:03] Jan 07 18:53:37 deployment-puppetserver-1 systemd[1]: prometheus_openstack_stale_puppet_certs.service: Deactivated successfully. [18:54:19] it'll be a long time until my fingers are trained to type 'puppetserver ca' [18:54:54] same, and i've used puppet 6+ for several years at home at this point [18:55:29] now there are a whole bunch of leaked certs, but they look legit [18:55:53] `puppetmaster_stale_cert{cert_instance="deployment-mediawiki81",cert_name="deployment-mediawiki81.deployment-prep.eqiad1.wikimedia.cloud",cert_project="deployment-prep"} 1.0` [18:57:36] Safe for me to run the magic one-liner? [18:57:46] (which deletes all certs not associated with an existing VM) [18:58:10] andrewbogott: I was just about to, just documenting things in the bug first [18:58:27] ok, I'll leave you to it (but I'm inspired to run it in some other places) [18:58:47] * bd808 runs `/usr/local/sbin/clean-stale-puppet-certs --clean` [19:00:43] oh, has that replaced the much-longer "for host in $(grep -o 'cert_name="[^"]*' /var/lib/prometheus/node.d/openstack_stale_puppet_certs.prom | cut -d'"' -f2); do ping -c1 -w1 "$host" && { echo "SKIPPING: $host is alive"; continue; }; puppetserver ca clean --certname "$host"; done" ? [19:01:28] andrewbogott: it ships with the puppet module [19:02:30] it looks like that script does not bother with the ping check [19:03:01] so could wipe out the cert for a host that has had puppet disabled for a long time [19:03:11] it parses the output of /usr/local/sbin/prometheus-openstack-stale-puppet-certs and runs `puppetserver ca clean` for each cert [19:03:41] yeah, it probably could. Maybe somebody should fix up the script from the Puppet module? [19:03:54] it should not, since it checks if openstack thinks the vm exists or not [19:04:07] oh that's better than a ping test [19:04:55] https://github.com/wikimedia/operations-puppet/blob/production/modules/prometheus/files/usr/local/bin/clean-stale-puppet-certs.py [19:05:34] heh. you wrote it initially andrewbogott :) [19:06:00] "We /probably/ don't want to fully automate this cleanup but this is a lot better than typing 'puppet cert clean' 1000 times." in the original commit [19:06:00] In that case: what a terrible idea! [19:06:47] I'll update https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates [19:08:58] andrewbogott: if you get bored, you could try to understand what elukey wrote on T383096 about the old Cergen certs [19:09:00] T383096: Multiple kafka Cergen certs expired in beta cluster - https://phabricator.wikimedia.org/T383096 [19:10:55] I think he's saying that we don't need any of them anymore. But it also looks like he might return and finish the task himself