[01:43:33] Does someone here know if T325232 https://gerrit.wikimedia.org/r/c/operations/puppet/+/908995 is still relevant? If not, feel free to close the patch. It's still showing on our team dashboard every day :) [01:43:34] T325232: Migrate Dumpsdata and Htmldumper Hosts From Buster to Bullseye - https://phabricator.wikimedia.org/T325232 [01:43:45] Gerrit team dashboard* [02:28:33] Krinkle: er, I can probably look into that. Data Platform SRE is taking over responsibility for the dumps 1infrastructure. [02:29:57] I'll check tomorrow to see if I can take over the patch. [09:29:33] GitLab needs a short maintenance break in one hour [10:33:33] GitLab upgrade done [13:03:45] !incidents [13:03:45] 4553 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [13:03:46] 4552 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [13:03:46] 4551 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [13:06:51] Quick check for my sanity: How are CDN purged from MW relayed to the CDN layer? Do they go through EventGateway? And through ChangeProp? [13:08:40] We don't use HTCP anymore, right? [13:09:12] duesen: I think that this is the best reference at the moment. https://wikitech.wikimedia.org/wiki/Kafka_HTTP_purging They use kafka-main, but not eventgate. [13:10:04] Oh, maybe they do use eventgate-main. [13:11:14] Thank you! That page sais: [13:11:19] EventBus extension provides an implementation of the EventRelayer, CdnPurgeEventRelayer that creates purge events and sends them to Kafka using normal EventBus flow - via eventgate service. [13:11:28] yep IIRC it is used, the purge events end up in kafka main and on every cdn node there is a daemon called "purged" that reads events from kafka and issues purge actions to varnish [13:12:19] Ok, so it's MW -> eventgate -> purged. But no changeprop. [13:12:53] MW -> eventgate -> kafka -> purged [13:13:33] No changeprop as far as I am aware. But don't quote me :-) [13:35:56] HTCP/multicast is long-gone IIRC [13:36:06] I'm not even sure our network would globally multicast it anymore [13:36:24] but I can do a quick sniff check and see if any is flowing on an edge node [13:37:32] actually I don't even have to sniff to check: I can see that "purged" doesn't even have the correct configuration/flags anymore to consume HTCP multicast events. [13:37:37] so even if they were being sent, nobody's listening [13:45:18] Cross posting (sorry for the noise) Hi folks, sent an email requesting volunteer coverage for an americas shift today and/or tomorrow please take a look and reach out to me if you can assist. thanks in advance for your consideration details in sre@ mailing list. [14:45:43] htcp was replaced by purged events during the move of changeprop to k8s, but changeprop *does* generate purges [14:48:14] see purge_stream references in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/changeprop/templates/_config.yaml [14:49:04] duesen: ^ [15:48:59] Oh, thanks hnowlan. I did not know that. [15:49:42] <_joe_> the way purges work from mediawiki is [15:50:37] <_joe_> mw -> htmlCacheUpdateJob to eventgate-main -> kafka-main -> cp-jobqueue -> mw jobrunner -> eventgate-main -> kafka -> purged [15:51:06] <_joe_> actually I think htmlCacheUpdate spawns a CdnUpdate job that generates the purge, but don't quote me on that :) [15:51:56] <_joe_> for the other stuff, they have changeprop (non-jq) listening to mediawiki events and generating either direct purges (that was the case for ORES) or calling services tht will generate the purge messages to kafka, like restbase [15:52:22] an incomplete summary here https://phabricator.wikimedia.org/P48229 [15:52:31] inside changeprop/restbase [15:52:36] <_joe_> duesen: I hope I made your understanding more fuzzy :) [16:04:35] Hi all! I encountered an unexpected diff during decommissioning a host - a large number of elastic servers were marked for removal during sre.dns.netbox cookbook step. I aborted out of it to be safe. Any hints on how to proceed? [16:05:00] inflatador: ^^ [16:06:30] cwhite that's related to my blowing up one of the small ES clusters yesterday. It's fixed but I need to clean up some hosts that are in half-decommisioned state. volans is it possible to skip over these changes for cwhite ? [16:07:44] I am not volan.s (obviously!) but either you can rollback your changes or cwhite will need to proceed with their removal for his own change [16:14:02] inflatador: no, Netbox generates the whole DNS autogenerted records [16:14:40] hs no concept of partial updates also because those could be potentially risky when there are cross-dependencies of things [16:16:28] so if you don't want some records to be deleted they need to be restored in Netbox [16:16:58] or this will block the propagation of any future changes to DNS records in Netbox [16:18:09] the diff of sudo cookbook -d sre.dns.netbox "test" seems fairly substantial (this will run the cookbook in dry mode) so worth double-checking IMO [16:29:25] inflatador: is it safe to remove these hosts from netbox and dns? [16:32:10] yeah, if we don't fix this, it's going to only get kinda worse. so if we can't remove these records for any reason, we should make a note of them by running -d with the diff changes and add them later [16:32:19] since there are really only two ways here sadly [16:33:15] we can re-create IPs nd DNS names in netbox if needed, but I don't know what's the expected status right now [16:33:38] looks like it will affect elastic2038-2048 and 2050-2054 [16:34:03] cwhite: can you dump the diff somewhere if you have it handy? otherwise I will run -d again [16:34:14] sure [16:34:16] https://phabricator.wikimedia.org/P58999 [16:34:17] done [16:34:21] thanks volans [16:34:36] inflatador: please check the diff [16:44:02] volans sukhe cwhite LGTM, feel free to move fwd [16:44:11] inflatador: thanks [16:44:12] cwhite: ^ [16:44:32] cool, thanks! [17:11:28] Hello everyone, while running our alertreview analysis tooling today I noticed that reprepro emails constitute a significant portion (63.76% in the last 100 days, 66.84% in the last year) of all emails sent to the 'root' alias, creating excessive noise. [17:12:19] I created a task to track this situation and I added a proposed solution for this issue, please take a look at it if you can, I'd appreciate contributions to reduce the noise reprepro emails generate: https://phabricator.wikimedia.org/T361262 [17:12:29] T361262 [17:12:30] T361262: Reduce 'root' Email Noise by Migrating Reprepro Emails to Google Group - https://phabricator.wikimedia.org/T361262 [17:16:45] I think it was supposed to be like a sanity check / security alert once, like to tell us if someone messes with the repo, afair. [17:19:24] if the problem is that those emails are spammy, how will routing them to all SREs via a google group improve the situation compared to an exim alias? [17:20:53] mutante: Yeah, but if the volume of alerts is too high without actionable input it just becomes noise. For example, in the last 100 days alone we got 542 emails to the 'root@' alias. None of those emails served as a sanity check / security alert, they're noisy. [17:21:28] hard to tell if people read them [17:21:37] taavi: it's in the task? you need to subscribe to it to receive emails [17:21:56] if I had to guess then most people have some rule setup so that they dont hit inbox [17:22:10] taavi: If those messages fall into a Google Group those who are interested in receiving the emails can opt-in to do so, while those of us who don't want to receive them could stop receiving them. If you have other ideas on how to solve this issue please add them in the task. :) [17:22:29] and then I'd argue it's better to delete the rules and discuss whether we need the mails or not [17:23:14] ah, I somehow read that being the other way around [17:24:15] mutante: Creating those rules so they don't fall into our inboxes is detrimental, we wouldn't need to filter noisy alerts if they're not sent in the first place. [17:25:26] Regarding engineers reading them, I guess many of us do, but receiving 542 (63.76% of total) emails with subject "reprepro changes in public_apt_repository" notifying us of changes to reprepro may not be the best use of our time. [17:26:42] I assume the alerts where put in place for a reason and I'm not sure if disabling them entirely is the right approach. I think that sending everything to an opt-in Google Group is a better solution so those who are interested can still receive those emails and we could also have the backlog, etc. [17:27:33] i wonder if we could feed them to a wikitech page, or some other place that's open to people without a wikimedia email address [17:27:34] I feel like we need to decide if they are important or not. if they are.. we need to decide who is supposed to receive them.. if they are not, we can disable them. Certainly there is little point in generating them only for everyone to filter them. ack [17:28:34] sending them off to an opt-in group feels like we'd be in undefined status [17:30:46] I expect infra-foundations will have a stronger opinion on it. [17:32:02] or maybe they can become "digest" versions that are sent out less frequently