[08:14:04] <_joe_>	 Krinkle: 1 - we shouldn't use grafana for alerts 2 - please don't change things for a production critical monitoring tool over the weekend unless it's needed as an emergency
[09:16:02] <godog>	 Krinkle: haven't seen that one before, would you mind opening a task to #sre-observability ?
[12:39:32] <andrewbogott>	 godog: We've had an alert on our alert manager dashboard for a few weeks, "Puppet CA certificate pontoon-puppetdb-01.monitoring.eqiad.wmflabs is about to expire in 23d 3h 0m 58s"  -- obviously I can ignore it (and it shouldn't really be on our dashboard anyway...)
[12:39:47] <andrewbogott>	 but is that something you might want to rotate? Or would you like me to try?
[12:53:51] <godog>	 andrewbogott: thank you I'll take a look
[12:54:12] <godog>	 andrewbogott: 'monitoring' project doesn't exist anymore
[12:54:35] <andrewbogott>	 godog: well that's interesting!
[12:54:51] <godog>	 indeed
[12:56:05] <andrewbogott>	 there was an incident a while back where some projects got removed without their included resources being cleared first, I wonder if this is one of those...
[12:56:11] <andrewbogott>	 if so I'll hunt & destroy :)
[13:01:10] <godog>	 sounds good
[13:06:45] <andrewbogott>	 huh, nope, that VM was properly destroyed. The mystery deepens!
[14:40:14] <mszabo>	 o/ x-posting from slack for awareness that the BE for ondemand excimer profiling appears to have bought the farm - https://phabricator.wikimedia.org/T384836
[14:43:37] <_joe_>	 mszabo: yeah those services are mostly "unowned" atm, we'll take a look
[14:44:06] <mszabo>	 _joe_: thanks! performance.wikimedia.org is up and running so this might be something that only affects internal requests
[14:47:08] <cdanis>	 thanks mszabo 
[14:50:49] <mszabo>	 _joe_: cdanis: https://github.com/wikimedia/operations-puppet/commit/8f2a66c3bc3c10eda727cd611d78194b285cf175 looks suspicious
[14:51:17] <mszabo>	 the assumption therein (that the BE is purely accessed via cdn) definitely is false
[14:51:43] <_joe_>	 mszabo: yup I just found the rule in nftables
[14:54:12] <_joe_>	 mszabo: reverting, thanks for reporting the bug <3
[14:54:32] <_joe_>	 We can think if we can actually restrict access here at a later time
[14:55:52] <mszabo>	 thanks!
[14:56:51] <mszabo>	 yeah it would probably need to allow all hosts that might be serving user-facing mw workloads at the very least for ondemand profiling to work
[15:03:01] <_joe_>	 mszabo: well AIUI this should've broken all excimer profiling
[15:03:26] <_joe_>	 ah no wait, it shouldn't have
[15:04:43] <mszabo>	 yea the SVG files use a different flow via a redis queue and onhost worker IIRC
[15:04:50] <_joe_>	 yes
[15:04:57] <Krinkle>	 godog: ack, filed T384840
[15:04:59] <_joe_>	 my mind somehow wanted to forget
[15:05:00] <stashbot>	 T384840: Unable to edit/delete Grafana alert - https://phabricator.wikimedia.org/T384840
[15:09:02] <godog>	 Krinkle: thank you
[15:19:54] <mszabo>	 _joe_: works now, thanks!
[15:50:32] <mszabo>	 cdanis: If I was to add a blackbox check for this endpoint, where should I start to look?
[15:51:07] <cdanis>	 mz`: great question, https://wikitech.wikimedia.org/wiki/Prometheus#Network_probes_%28blackbox_exporter%29 probably
[15:51:12] <cdanis>	 mszabo: ^
[15:52:11] <mutante>	 mszabo: modules/profile/manifests/microsites/monitoring.pp in the puppet repo has a couple simple examples for http checks to copy from 
[15:53:22] <mszabo>	 thanks!
[15:53:32] <mutante>	 modules/alertmanager/templates/alertmanager.yml.erb   is where you would configure what should actually happen if it alerts.. like who to notify and how
[15:55:47] <cdanis>	 if you can get away with using the checks in the `service::catalog` then please do, however
[15:56:23] <_joe_>	 I'm pretty sure we already have alerts for performance.w.org
[15:56:36] <_joe_>	 but well not probes from the internal networks
[15:58:05] <cdanis>	 yeah, so he might not be able to get away with it
[16:08:25] <godog>	 also blackbox probes really test whether the service works, not network access, probes are run from prometheus hosts which have access to all ports
[16:09:54] <cdanis>	 hmm
[16:12:16] <godog>	 I'll add a note to the docs above about this 
[16:15:15] <cdanis>	 good point godog and I'm not sure how best to add a check that (say) mw-debug pods can manually export profiles
[16:16:54] <mszabo>	 ah okay
[16:17:12] <godog>	 mmhh would that failure already show up in some mw / excimer metrics ?
[16:17:36] <mszabo>	 this was an ondemand-only failure so it did cause a log, but only when someone (me) explicitly tried to profile something
[16:18:00] <mszabo>	 PHP Warning: ExcimerUI server error: Failed to connect to performance.discovery.wmnet port 443: Connection timed out 
[16:18:12] <cdanis>	 godog: so, apparently the automated profile collection takes a totally different path
[16:18:21] <cdanis>	 the ondemand manual one does a POST directly
[16:18:44] <mszabo>	 yea, automated profiles get sent to redis and processed by a smorgasbord of onhost workers
[16:19:34] <godog>	 interesting, ok, yeah not sure offhand how to best properly alert
[16:20:04] <cdanis>	 mszabo: is there anything in the HTTP response from MW as to whether or not posting the profile worked?
[16:23:53] <godog>	 gotta go, will catch up later/tomorrow
[16:34:38] <mszabo>	 cdanis: in the response no, just a log
[16:35:38] <cdanis>	 maybe we could add a response header on a failure (although now that i know mediawiki a bit i can imagine reasons why that would be hard/impossible)
[16:35:47] <cdanis>	 or we could watch for those log messages, that's possible too
[16:39:45] <mszabo>	 I think it's fine as is, if someone tries to do ondemand profiling and it fails, they can use the req ID to find relevant logs
[16:41:54] <cdanis>	 eh I do think we should have some sort of alert for this flow not working, especially given it's different from the usual
[18:17:42] <swfrench-wmf>	 greetings! FYI for wider awareness, ~ 1% of enrollable external clients are now serving on PHP 8.1. T383845 has the latest. I'll be keeping an eye on this throughout the day today.
[18:17:42] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[18:18:40] <mutante>	 swfrench-wmf: congrats
[18:32:12] <inflatador>	 {◕ ◡ ◕}
[18:33:16] <mszabo>	 \o/
[18:33:28] <mszabo>	 8.4 in a few months? :)
[18:34:00] <swfrench-wmf>	 mszabo: thank you very much for your help with some of those deprecation errors
[18:34:20] <mszabo>	 yw, fortunately these have been relatively uncomplicated thus far
[18:34:47] <swfrench-wmf>	 :)
[18:35:39] <swfrench-wmf>	 but also, if this procedure for PHP updates goes fairly smoothly, it should be something we can easily reuse for the next one, yeah
[18:35:57] <mszabo>	 I suppose this is also a nice benefit of mw-on-k8s
[18:36:10] <mszabo>	 less drudgery involved in preparing a new version
[18:45:06] <swfrench-wmf>	 indeed, once we figured out exactly _how_ we wanted to make this "mw-on-k8s shaped" the rest of the infrastructure bits have been / should be fairly straightforward :)
[18:52:14] <sukhe>	 nice work team!
[20:30:34] <brett>	  Anyone know how to easily rename an apt component?
[20:30:36] <brett>	 So far it's looking like I have to re-upload everything :/
[20:31:59] <sukhe>	 brett: doubt that is possible; fwiw, I also just reuploaded once when I did that
[20:32:34] <brett>	 booo
[20:33:01] <sukhe>	 maybe search Phab to see if you can spot something? 
[20:34:13] <brett>	 I've been unsuccessful, sadly
[20:34:52] <brett>	 pita
[20:49:27] <sukhe>	 :]
[21:21:46] <inflatador>	 probably a question for moritz-m  or E-mperor when they get back
[21:39:27] <brett>	 turns out I don't need to do that after all... derp
[21:39:29] <brett>	 But thank you :)