[01:33:38] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting, 10Patch-For-Review: Update mobileapps k8s deployment chart for Cassandra credentials - https://phabricator.wikimedia.org/T350507 (10Eevans) >>! In T350507#9468669, @Jgiannelos wrote: > @Eevans Sine things are movin...
[01:35:10] <wikibugs>	 10serviceops, 10SRE: scap not installed on mw1486.eqiad.wmnet which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10Mstyles)
[08:45:54] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10Clement_Goubert)
[08:45:58] <wikibugs>	 10serviceops, 10SRE: scap not installed on mw1486.eqiad.wmnet which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Scap deployments have been running fine following the proxy replacement. Re...
[09:04:58] <godog>	 hi folks, I'm looking for a quick review re: the removal of graphite mw alerts in favor of prometheus https://gerrit.wikimedia.org/r/c/operations/alerts/+/991007?usp=dashboard and https://gerrit.wikimedia.org/r/c/operations/puppet/+/991008?usp=dashboard
[09:27:05] <claime>	 godog: I'm not sure I'm reading it correctly. The old alert was over 30% failure in 15 minutes or 50/min, the new one is 20/min over 2 minutes ?
[09:32:06] <godog>	 claime: for the old one I'm reading it as "looking back 15 minutes at least 30% of datapoints are > 50" but yes I've also eyeballed the graph on what might be reasonable
[09:33:50] <claime>	 godog: ah, gotcha, anyways we can tweak it if it's too noisy 
[09:34:18] <godog>	 yeah easy enough
[09:34:37] <godog>	 the dashboard panel is already on the new metric btw
[09:35:15] <claime>	 ack
[09:35:18] <claime>	 gg :)
[09:35:37] <claime>	 +1'd
[09:35:42] <godog>	 neato, thank you claime 
[09:35:48] * godog sends wikilove claime's way
[09:35:58] <claime>	 <3
[10:06:55] <wikibugs>	 10serviceops, 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) The deeper reason behind most of this mess is the probably the uniqueness of the `test` release. There is no other environment whe...
[10:33:09] <nemo-yiannis>	 Hi! We want to switchover outgoing traffic from PCS to RESTBase parsoid and point it to MW parsoid. While double checking things with my team we wanted to make sure we have a way to invalidate storage on RESTBase in case something goes wrong so we dont corrupt cassandra content. Back in the Peter Pchelco had a script to read events from kafka and invalidate entries from RESTBase cassandra.
[10:33:29] <nemo-yiannis>	 Does anyone know if it still exists or how to do something similar?
[10:36:29] <akosiaris>	 doesn't ring a bell
[10:36:51] <akosiaris>	 looking into his home dirs on various hosts doesn't return anything either
[10:41:16] <nemo-yiannis>	 OK thanks. I will check on old phabricator tickets to see if i find something.
[10:53:53] <nemo-yiannis>	 Hm, i think it was this: https://github.com/wikimedia/restbase/pull/1297/files
[11:00:08] <akosiaris>	 this piece of code, seems so simple and yet I am not sure I understand what it does
[11:11:43] <akosiaris>	 it mutates the response object to add cache-control: no-cache if Etag isn't in the timeframe specified? 
[11:12:12] <akosiaris>	 how is that purging things? 
[11:13:01] <nemo-yiannis>	 it get from tid the timestamp and if it falls in the timewindow we want to purge instead of fetching from storage it triggers pregeneration
[11:13:04] <nemo-yiannis>	 (i think)
[11:14:05] <akosiaris>	 ah, so it's not "purging" per se, but just bypassing "storing/caching"
[11:14:08] <nemo-yiannis>	 yes
[11:26:32] <akosiaris>	 it doesn't look like this does what you described wanting to do though. This would (hopefully) bypass caching, thus avoiding polluting cassandra content, while you want to be able to recover from a polluted cassandra content situation?
[11:34:52] <nemo-yiannis>	 True this will not fix the content in cassandra but instead serve fresh content for that time window
[11:39:08] <nemo-yiannis>	 Maybe we can just query cassandra based on that tid and drop the rows directly?
[11:39:20] <nemo-yiannis>	 I will check in with data persistence folks
[11:39:58] <akosiaris>	 if you know the tids, yes that would be possible
[12:18:36] <wikibugs>	 10serviceops, 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Lucas_Werkmeister_WMDE) >>! In T355685#9484091, @akosiaris wrote: > My high level suggestion would be to re-evaluate if the `test` helm relea...
[12:42:33] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[12:43:38] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[14:13:49] <_joe_>	 nemo-yiannis, akosiaris IIRC calls to restbase URLs with the cache-control: no-cache header do invalidate cache in restbase
[14:44:54] <nemo-yiannis>	 yeah, what i dont know is if we have a way to send a request with `cache-control: no-cache` for every pregeneration event that changeprop sent in case there was something wrong with the change and we store corrupted content
[14:48:06] <hnowlan>	 Petr definitely had some tool for this but I never used it myself :/ 
[14:59:16] <wikibugs>	 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2427.codfw.wmnet with OS bullseye
[14:59:19] <wikibugs>	 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2430.codfw.wmnet with OS bullseye
[15:00:29] <wikibugs>	 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2446.codfw.wmnet with OS bullseye
[15:29:04] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[15:40:10] <wikibugs>	 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2427.codfw.wmnet with OS bullseye completed: - mw2427 (**PASS**)   - Downtimed on...
[15:42:21] <wikibugs>	 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2430.codfw.wmnet with OS bullseye completed: - mw2430 (**PASS**)   - Downtimed on...
[15:45:09] <wikibugs>	 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2446.codfw.wmnet with OS bullseye completed: - mw2446 (**PASS**)   - Downtimed on...
[16:18:25] <wikibugs>	 10serviceops, 10Commons, 10MediaWiki-File-management, 10SRE, and 4 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Bawolff) Just trying to think up solutions - if thumbor gives a 429, could varnish instead send an (u...
[16:31:54] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 (10Clement_Goubert)
[17:14:24] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)