[08:14:04] <_joe_> Krinkle: 1 - we shouldn't use grafana for alerts 2 - please don't change things for a production critical monitoring tool over the weekend unless it's needed as an emergency [09:16:02] Krinkle: haven't seen that one before, would you mind opening a task to #sre-observability ? [12:39:32] godog: We've had an alert on our alert manager dashboard for a few weeks, "Puppet CA certificate pontoon-puppetdb-01.monitoring.eqiad.wmflabs is about to expire in 23d 3h 0m 58s" -- obviously I can ignore it (and it shouldn't really be on our dashboard anyway...) [12:39:47] but is that something you might want to rotate? Or would you like me to try? [12:53:51] andrewbogott: thank you I'll take a look [12:54:12] andrewbogott: 'monitoring' project doesn't exist anymore [12:54:35] godog: well that's interesting! [12:54:51] indeed [12:56:05] there was an incident a while back where some projects got removed without their included resources being cleared first, I wonder if this is one of those... [12:56:11] if so I'll hunt & destroy :) [13:01:10] sounds good [13:06:45] huh, nope, that VM was properly destroyed. The mystery deepens! [14:40:14] o/ x-posting from slack for awareness that the BE for ondemand excimer profiling appears to have bought the farm - https://phabricator.wikimedia.org/T384836 [14:43:37] <_joe_> mszabo: yeah those services are mostly "unowned" atm, we'll take a look [14:44:06] _joe_: thanks! performance.wikimedia.org is up and running so this might be something that only affects internal requests [14:47:08] thanks mszabo [14:50:49] _joe_: cdanis: https://github.com/wikimedia/operations-puppet/commit/8f2a66c3bc3c10eda727cd611d78194b285cf175 looks suspicious [14:51:17] the assumption therein (that the BE is purely accessed via cdn) definitely is false [14:51:43] <_joe_> mszabo: yup I just found the rule in nftables [14:54:12] <_joe_> mszabo: reverting, thanks for reporting the bug <3 [14:54:32] <_joe_> We can think if we can actually restrict access here at a later time [14:55:52] thanks! [14:56:51] yeah it would probably need to allow all hosts that might be serving user-facing mw workloads at the very least for ondemand profiling to work [15:03:01] <_joe_> mszabo: well AIUI this should've broken all excimer profiling [15:03:26] <_joe_> ah no wait, it shouldn't have [15:04:43] yea the SVG files use a different flow via a redis queue and onhost worker IIRC [15:04:50] <_joe_> yes [15:04:57] godog: ack, filed T384840 [15:04:59] <_joe_> my mind somehow wanted to forget [15:05:00] T384840: Unable to edit/delete Grafana alert - https://phabricator.wikimedia.org/T384840 [15:09:02] Krinkle: thank you [15:19:54] _joe_: works now, thanks! [15:50:32] cdanis: If I was to add a blackbox check for this endpoint, where should I start to look? [15:51:07] mz`: great question, https://wikitech.wikimedia.org/wiki/Prometheus#Network_probes_%28blackbox_exporter%29 probably [15:51:12] mszabo: ^ [15:52:11] mszabo: modules/profile/manifests/microsites/monitoring.pp in the puppet repo has a couple simple examples for http checks to copy from [15:53:22] thanks! [15:53:32] modules/alertmanager/templates/alertmanager.yml.erb is where you would configure what should actually happen if it alerts.. like who to notify and how [15:55:47] if you can get away with using the checks in the `service::catalog` then please do, however [15:56:23] <_joe_> I'm pretty sure we already have alerts for performance.w.org [15:56:36] <_joe_> but well not probes from the internal networks [15:58:05] yeah, so he might not be able to get away with it [16:08:25] also blackbox probes really test whether the service works, not network access, probes are run from prometheus hosts which have access to all ports [16:09:54] hmm [16:12:16] I'll add a note to the docs above about this [16:15:15] good point godog and I'm not sure how best to add a check that (say) mw-debug pods can manually export profiles [16:16:54] ah okay [16:17:12] mmhh would that failure already show up in some mw / excimer metrics ? [16:17:36] this was an ondemand-only failure so it did cause a log, but only when someone (me) explicitly tried to profile something [16:18:00] PHP Warning: ExcimerUI server error: Failed to connect to performance.discovery.wmnet port 443: Connection timed out [16:18:12] godog: so, apparently the automated profile collection takes a totally different path [16:18:21] the ondemand manual one does a POST directly [16:18:44] yea, automated profiles get sent to redis and processed by a smorgasbord of onhost workers [16:19:34] interesting, ok, yeah not sure offhand how to best properly alert [16:20:04] mszabo: is there anything in the HTTP response from MW as to whether or not posting the profile worked? [16:23:53] gotta go, will catch up later/tomorrow [16:34:38] cdanis: in the response no, just a log [16:35:38] maybe we could add a response header on a failure (although now that i know mediawiki a bit i can imagine reasons why that would be hard/impossible) [16:35:47] or we could watch for those log messages, that's possible too [16:39:45] I think it's fine as is, if someone tries to do ondemand profiling and it fails, they can use the req ID to find relevant logs [16:41:54] eh I do think we should have some sort of alert for this flow not working, especially given it's different from the usual [18:17:42] greetings! FYI for wider awareness, ~ 1% of enrollable external clients are now serving on PHP 8.1. T383845 has the latest. I'll be keeping an eye on this throughout the day today. [18:17:42] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:18:40] swfrench-wmf: congrats [18:32:12] {◕ ◡ ◕} [18:33:16] \o/ [18:33:28] 8.4 in a few months? :) [18:34:00] mszabo: thank you very much for your help with some of those deprecation errors [18:34:20] yw, fortunately these have been relatively uncomplicated thus far [18:34:47] :) [18:35:39] but also, if this procedure for PHP updates goes fairly smoothly, it should be something we can easily reuse for the next one, yeah [18:35:57] I suppose this is also a nice benefit of mw-on-k8s [18:36:10] less drudgery involved in preparing a new version [18:45:06] indeed, once we figured out exactly _how_ we wanted to make this "mw-on-k8s shaped" the rest of the infrastructure bits have been / should be fairly straightforward :) [18:52:14] nice work team! [20:30:34] Anyone know how to easily rename an apt component? [20:30:36] So far it's looking like I have to re-upload everything :/ [20:31:59] brett: doubt that is possible; fwiw, I also just reuploaded once when I did that [20:32:34] booo [20:33:01] maybe search Phab to see if you can spot something? [20:34:13] I've been unsuccessful, sadly [20:34:52] pita [20:49:27] :] [21:21:46] probably a question for moritz-m or E-mperor when they get back [21:39:27] turns out I don't need to do that after all... derp [21:39:29] But thank you :)