[01:33:38] 10serviceops, 10Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting, 10Patch-For-Review: Update mobileapps k8s deployment chart for Cassandra credentials - https://phabricator.wikimedia.org/T350507 (10Eevans) >>! In T350507#9468669, @Jgiannelos wrote: > @Eevans Sine things are movin... [01:35:10] 10serviceops, 10SRE: scap not installed on mw1486.eqiad.wmnet which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10Mstyles) [08:45:54] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10Clement_Goubert) [08:45:58] 10serviceops, 10SRE: scap not installed on mw1486.eqiad.wmnet which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Scap deployments have been running fine following the proxy replacement. Re... [09:04:58] hi folks, I'm looking for a quick review re: the removal of graphite mw alerts in favor of prometheus https://gerrit.wikimedia.org/r/c/operations/alerts/+/991007?usp=dashboard and https://gerrit.wikimedia.org/r/c/operations/puppet/+/991008?usp=dashboard [09:27:05] godog: I'm not sure I'm reading it correctly. The old alert was over 30% failure in 15 minutes or 50/min, the new one is 20/min over 2 minutes ? [09:32:06] claime: for the old one I'm reading it as "looking back 15 minutes at least 30% of datapoints are > 50" but yes I've also eyeballed the graph on what might be reasonable [09:33:50] godog: ah, gotcha, anyways we can tweak it if it's too noisy [09:34:18] yeah easy enough [09:34:37] the dashboard panel is already on the new metric btw [09:35:15] ack [09:35:18] gg :) [09:35:37] +1'd [09:35:42] neato, thank you claime [09:35:48] * godog sends wikilove claime's way [09:35:58] <3 [10:06:55] 10serviceops, 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) The deeper reason behind most of this mess is the probably the uniqueness of the `test` release. There is no other environment whe... [10:33:09] Hi! We want to switchover outgoing traffic from PCS to RESTBase parsoid and point it to MW parsoid. While double checking things with my team we wanted to make sure we have a way to invalidate storage on RESTBase in case something goes wrong so we dont corrupt cassandra content. Back in the Peter Pchelco had a script to read events from kafka and invalidate entries from RESTBase cassandra. [10:33:29] Does anyone know if it still exists or how to do something similar? [10:36:29] doesn't ring a bell [10:36:51] looking into his home dirs on various hosts doesn't return anything either [10:41:16] OK thanks. I will check on old phabricator tickets to see if i find something. [10:53:53] Hm, i think it was this: https://github.com/wikimedia/restbase/pull/1297/files [11:00:08] this piece of code, seems so simple and yet I am not sure I understand what it does [11:11:43] it mutates the response object to add cache-control: no-cache if Etag isn't in the timeframe specified? [11:12:12] how is that purging things? [11:13:01] it get from tid the timestamp and if it falls in the timewindow we want to purge instead of fetching from storage it triggers pregeneration [11:13:04] (i think) [11:14:05] ah, so it's not "purging" per se, but just bypassing "storing/caching" [11:14:08] yes [11:26:32] it doesn't look like this does what you described wanting to do though. This would (hopefully) bypass caching, thus avoiding polluting cassandra content, while you want to be able to recover from a polluted cassandra content situation? [11:34:52] True this will not fix the content in cassandra but instead serve fresh content for that time window [11:39:08] Maybe we can just query cassandra based on that tid and drop the rows directly? [11:39:20] I will check in with data persistence folks [11:39:58] if you know the tids, yes that would be possible [12:18:36] 10serviceops, 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, and 2 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10Lucas_Werkmeister_WMDE) >>! In T355685#9484091, @akosiaris wrote: > My high level suggestion would be to re-evaluate if the `test` helm relea... [12:42:33] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [12:43:38] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [14:13:49] <_joe_> nemo-yiannis, akosiaris IIRC calls to restbase URLs with the cache-control: no-cache header do invalidate cache in restbase [14:44:54] yeah, what i dont know is if we have a way to send a request with `cache-control: no-cache` for every pregeneration event that changeprop sent in case there was something wrong with the change and we store corrupted content [14:48:06] Petr definitely had some tool for this but I never used it myself :/ [14:59:16] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2427.codfw.wmnet with OS bullseye [14:59:19] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2430.codfw.wmnet with OS bullseye [15:00:29] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2446.codfw.wmnet with OS bullseye [15:29:04] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:40:10] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2427.codfw.wmnet with OS bullseye completed: - mw2427 (**PASS**) - Downtimed on... [15:42:21] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2430.codfw.wmnet with OS bullseye completed: - mw2430 (**PASS**) - Downtimed on... [15:45:09] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2446.codfw.wmnet with OS bullseye completed: - mw2446 (**PASS**) - Downtimed on... [16:18:25] 10serviceops, 10Commons, 10MediaWiki-File-management, 10SRE, and 4 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Bawolff) Just trying to think up solutions - if thumbor gives a 429, could varnish instead send an (u... [16:31:54] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 (10Clement_Goubert) [17:14:24] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)