[00:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:18:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:08:45] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:46] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:45] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:45:46] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:45] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:27] hi effie, is there a way for us to verify that the files for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037528 have synced everywhere? [07:05:46] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:20] XioNoX, topranks: there are pending changes for "et-1-0-2-103.cr2-codfw" in the netbox/dns cookbook [08:26:29] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: accounting report failure - https://phabricator.wikimedia.org/T366874#9874419 (10Volans) 05Open→03Resolved a:03Volans The problem was that a device had been inserted with a numeric serial that the spreadsheet saved as integer. I've... [08:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:34] volans: sorry bout that my bad [09:09:33] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9874533 (10Volans) I've also manually fixed a bunch of warnings due to a clearly mistyped phabricator task number in the spreadsheet.... [09:18:16] on netbox-next validators/dcim/interface.py has local modifications, I'm not touching it but please revert it once done with the testing [09:18:29] I'm also re-enabling puppet (was left disabled) [09:20:25] I'll probably stash it and re-apply it to deploy my changes [09:30:02] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9874557 (10Volans) 05Open→03Resolved This is now completed. The new runs are not alerting for these hosts with replaced mother... [09:59:12] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704#9874622 (10Volans) I've took a look today and trying to manually run all the tests there isn't anyone that takes so long to trigger the 300s timeout,... [10:48:15] volans: those changes were left over from my testing, I've reverted it now [10:48:27] (in interface.py validator0 [10:49:31] k [10:49:33] thx [10:49:43] I've already deployed stashing them and re-applying them [10:49:45] just in case [11:29:43] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704#9874878 (10cmooney) >>! In T321704#9874622, @Volans wrote: > I've took a look today and trying to manually run all the tests there isn't anyone that... [11:41:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (91.07%) on ganeti1017:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [12:26:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (93.65%) on ganeti1017:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [12:33:29] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9875143 (10cmooney) [12:40:14] hello folks! [12:40:24] I am updating eventrouter and kube-state-metrics in aux-eqiad [12:40:28] cc: cdanis [13:36:45] thank you elukey !! [13:37:14] cdanis: np! Jaeger is already rolled out right? [13:37:24] yes I did that Friday [13:37:32] <3 [13:43:05] kostajh: is there something specific you needed confirmed about that patch? [13:43:18] just that the GeoLite2 files were available? [14:05:33] 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations: Update fundraising mail settings to use new production mx hosts - https://phabricator.wikimedia.org/T366740#9875478 (10jhathaway) p:05Triage→03Medium [14:09:11] 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797#9875489 (10MoritzMuehlenhoff) I think we should rather base this on a given kernel version? Seems more robust than a given date. [14:25:37] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9875546 (10cmooney) p:05Triage→03Medium [14:29:21] 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797#9875574 (10Volans) p:05Triage→03Medium [14:58:09] cdanis: yeah just that the files are available. Giuseppe helped confirm that they are. I need to check later in the week that they updated successfully [14:59:50] 10netops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#9875713 (10MatthewVernon) [15:01:47] kostajh: do those files also get baked into the image used for k8s? [15:06:00] cdanis: I’m not sure. Longer term we may try to move the data into an OpenSearch instance https://phabricator.wikimedia.org/T357753#9874390 [15:08:00] ah, I hope an OpenSearch instance with an SLA though :) [15:09:08] heh 😈 [15:13:45] 10netops, 06Infrastructure-Foundations, 06SRE: Juniper QFX5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801#9875780 (10cmooney) 05Open→03Resolved We seem to have no such errors being logged any more, either from these switches or the d... [15:20:10] cdanis: kostajh, these files are deployed by puppet to the k8s nodes and mounted as hostPath [15:20:16] thanks claime [15:20:29] makes sense [15:39:07] kostajh: claime: it looks like the new files are missing from most of codfw ? [15:39:26] hmmm [15:40:08] https://phabricator.wikimedia.org/P64540 [15:42:03] claime: the new files aren't in the volatile dir on the secondary puppetmasters, honestly not sure if that matters or not [15:42:41] cdanis: I don’t know why that would be the case. Cc effie who worked on this with me [15:42:54] or on the new puppetservers [15:43:32] I guess the problem is it's not on all puppetservers? [15:43:36] I'm not sure [15:43:56] It's pulling from volatile [15:43:58] > files should be added to the puppet ca host. they are then rsynced from the other puppetmasteres via a systemd::timer::job (sync-puppet-volatile.timer). [15:44:49] that must be out of date? [15:45:04] possibly yes [15:45:19] that timer/service only exists on puppetmaster2001 [15:45:43] on the puppetservers it has been ... starting up? for a month and a half? [15:49:37] _joe_: what did you check with kosta this morning? just to figure out where things are [15:50:24] <_joe_> effie: that the maxmind files were correctly loaded [15:50:36] <_joe_> in k8s [15:50:46] <_joe_> they were, he thought not IIRC [15:51:00] _joe_: the geolite files are missing from most of codfw. [15:51:04] https://phabricator.wikimedia.org/P64540 [15:51:38] alright cool, we are either missed something in the code, or something in how this mess generally works :p [15:51:42] have* [15:51:58] it also looks like there's something quite wrong with the puppetmaster rsync jobs for volatile [15:52:04] but I haven't figured that out yet [15:52:22] <_joe_> cdanis: ah sigh, ofc I checked an eqiad node and pod [15:52:33] ● sync-puppet-volatile.service - rsync puppet volatile data from primary server [15:52:35] Loaded: loaded (/lib/systemd/system/sync-puppet-volatile.service; static) [15:52:37] Active: activating (start) since Sat 2024-04-27 00:12:03 UTC; 1 month 14 days ago [15:52:39] like, *what* [15:52:40] <_joe_> sigh [15:52:55] <_joe_> ok that kind of explains it I guess :P [15:53:08] <_joe_> there's more data that is out of sync in codfw than just geoip :/ [15:53:57] the timer doesn't even exist on some puppetmasters _joe_ [15:54:49] <_joe_> cdanis: it should be only on the frontends [15:54:59] <_joe_> because that's where volatile is served from [15:55:21] <_joe_> so 1001, 2001, and whatever puppetserver acts as a frontend [15:55:37] hm [15:56:06] ok well maybe it's okay then [15:56:16] so then we have to figure out why they're missing from codfw nodes [15:57:07] they are on puppetmaster2001 but *not* on any codfw puppetserver [15:58:54] <_joe_> ok so I guess the volatile sync got borked in the transition to puppetserver? [15:59:03] <_joe_> but i never followed it [15:59:06] something dumber than that happened [16:00:24] https://phabricator.wikimedia.org/P64541 [16:00:37] the rsync is literally just hung waiting on a socket that I really doubt still actually exists [16:01:30] :( [16:02:31] I restarted it [16:02:35] it has synced [16:03:10] the GeoLite2 files are there [16:04:35] also I can verify now that new puppet runs are fixing it in codfw [16:05:12] I'm going to kick off puppet runs on all codfw hosts that should have the file (per puppetdb) and don't [16:05:49] +1 [16:06:33] we uh [16:06:45] we should probably put a timeout [16:09:58] on the job overall [16:10:02] possibly, by default on all timers [16:10:55] yeah probably better with RuntimeMaxSec or similar on the systemd side than finding for each one their own timeout and hoping it will be respected [16:11:02] yeah [16:11:29] ideally with an alert when it does times out [16:11:58] and probably RuntimeMaxSec will automatically trigger the failed systemd unit one, so we should be good ont hat side [16:16:43] yeah [16:16:47] that's what I was thinking [16:16:51] and in this case, a simple restart fixes it [16:17:09] makes sense to me [16:19:44] volans: the main thing I'm wondering right now is if we need to do any more work to look for other impacts, or, to write an incident report [16:20:13] from the puppet runs what else did you notice in terms of diffs? [16:20:25] I haven't looked [16:20:28] and [16:20:36] * volans doesn't recall if volatile files are shown in the diff [16:20:40] I'm only running puppet by hand on `C:geoip::data::puppet%fetch_ipinfo_dbs=true` [16:20:49] by now it has run everywhere [16:20:56] not quite yet :) [16:21:04] it's only been about 15 minutes [16:21:14] my cumin run is 62% [16:21:24] how small was your batch size? :D [16:21:29] 8 [16:21:34] too small :D [16:21:40] ok I wasn't sure if that was still true or not [16:21:41] for puppet [16:23:29] I *think* we're in a much better state now, I also think to remember that someone run a large batch without batch size, so technically capped at 64 by the cumin's config [16:23:35] and puppetservers/masters didn't die [16:23:43] prior work was done in T280622 [16:23:43] cool [16:23:44] T280622: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 [16:23:46] I restarted with 24 [16:24:05] it is only rerunning puppet if it doesn't exist, so [16:24:10] so I'd say 30~40 should be totally fine, and probably we can live even without batch nowadays between the split of masters/servers [16:24:15] ack [16:35:18] volans: Notice: /Stage[main]/Geoip::Data::Puppet/File[/usr/share/GeoIPInfo/GeoIP2-Enterprise.mmdb]/content: content changed '{sha256}c00c314d96ee26c978584df4bb6072519ff76196abd45cb3b9a4e9de78d85653' to '{sha256}293719dd92dc4522dd381f0632576ae87525205c6d9e1e77c21618345ab947a3' [16:35:23] so yeah it does get logged [16:36:01] uhmmm [16:36:04] except it's still missing from some nodes [16:36:20] nice, so potentially we could look into puppetdb for other volatile files, but there is no mention of "volatile" so it would be a bit of a mess to find them all [16:36:43] probably easier to search for modified time files in volatile on the puppetmaster [16:36:59] are *all* of the puppetservers frontends? [16:37:00] and see if there is anythign that could generate user visibile issues [16:37:02] did that change [16:37:10] jhathaway: ^^^ [16:37:28] I'm guessing that's true and I actually need to restart the systemd service backing the timer on all of them [16:38:06] yes they are all frontends [16:38:10] ack [16:40:05] https://phabricator.wikimedia.org/P64542 for posterity [16:41:11] (6) puppetmaster2001.codfw.wmnet,puppetserver[2001-2003].codfw.wmnet,puppetserver[1002-1003].eqiad.wmnet [16:41:13] ----- OUTPUT of 'systemctl stop sync-puppet-volatile' ----- [16:41:15] Warning: Stopping sync-puppet-volatile.service, but it can still be activated by: [16:41:17] sync-puppet-volatile.timer [16:48:10] cdanis: too bad we didn't have any monitoring around that timer, probably valuable to add [16:48:39] jhathaway: yeah I'm ... I'm really curious what made it hang on *all* the other servers around that time [16:49:03] does it align with anything in SAL? [16:49:11] for sure [16:49:13] I haven't looked, I've been focusing on meeting or fixing it [16:49:27] I think as volans said we can just add RuntimeMaxSec to the timer (perhaps a very long default to all timers, to prevent a really weird occurrence like this again) [16:52:46] okay so, after fixing the sync job on the puppetservers, now we have the new files on all the codfw hosts [16:53:18] \o/ [16:53:26] thanks for digging into this [16:53:30] indeed! [16:53:57] I'll write up at least a Phab task about this in my afternoon [16:55:02] cdanis: regarding RuntimeMaxSec "Note that this setting does not have any effect on Type=oneshot services, as they terminate immediately after activation completed.", instead you want TimeoutStartSec [16:55:10] ah, thanks [16:55:26] there is quite a bit of misuse of RuntimeMaxSec in our puppetry, need to cut a phab task to address it [16:55:58] doh, sorry, I did say "RuntimeMaxSec or similar" to cover my back :D [16:56:14] thanks for the clarification [16:56:56] :) [17:02:54] cdanis & _joe_ thank you for figuring out the mess there, my impression last week was "everything willl be in place on monday" [17:03:04] I'm glad kostajh kept asking about it [17:03:24] I went to write a simple one-liner in cumin, thinking, oh, they'll all have the same checksum, it will be easy to verify that it's working [17:03:40] and then after briefly going down a rabbithole of the Puppet grammar I got very concerned [17:03:55] hehe [17:05:59] <_joe_> I only literally gave out the wrong info :D [18:06:24] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9876833 (10wiki_willy) Thanks @Volans, will do on the remaining Netbox errors. >>! In T358542#9874557, @Volans wrote: > This is... [18:40:06] ok so [18:40:08] uh [18:40:10] Active: activating (start) since Sat 2024-04-27 00:12:03 UTC; 1 month 14 days ago [18:40:20] Apr 27 00:12:03 puppetmaster1001 /puppet-merge.py: (private) Starting merge for: /var/lib/git/labs/private [18:40:22] Apr 27 00:12:04 puppetmaster1001 /puppet-merge.py: (puppet) Starting merge for: /var/lib/git/operations/puppet [18:40:50] this can't be coincidence right [18:42:19] jhathaway: it's hard to say this is a smoking gun because 1) how could this affect rsync'ing files from volatile and 2) it's been a month and a half [18:42:25] but that timing is really weird. [18:44:56] hmm [18:46:04] possibly just coincidence [18:46:10] all the other rsync jobs got stuck later [18:46:17] Active: activating (start) since Sat 2024-04-27 00:08:19 UTC; 1 month 14 days ago [18:46:21] Active: activating (start) since Sat 2024-04-27 00:10:38 UTC; 1 month 14 days ago [18:46:24] Active: activating (start) since Sat 2024-04-27 00:12:03 UTC; 1 month 14 days ago [18:48:51] nod, very strange, this is regular rsync, not rsync over ssh? [18:49:15] yeah, via our own rsync module [18:50:28] right, seem surprising it would just hang like that [18:51:28] yeah, I'm not sure what happened, aside from the one client side was blocked on select() waiting for the socket to be readable [18:51:39] the one client side I straced before just killing it, I mean [18:53:02] --timeout=SECONDS set I/O timeout in seconds [18:53:04] --contimeout=SECONDS set daemon connection timeout in seconds [18:53:11] these look interesting, if we don't already set them [18:53:14] yeah seems like we should be setting those [18:56:03] here's something else interesting: puppetserver1003, which didn't get hard stuck ... also didn't work for a ~1.5hr window around the time of the issue [18:56:07] puppetserver1001/syslog.log-20240427.gz:Apr 26 23:55:21 puppetserver1001 rsyncd[1557293]: rsync on puppet_volatile/ from puppetserver1003.eqiad.wmnet (10.64.0.23) [18:56:09] puppetserver1001/syslog.log-20240428.gz:Apr 27 01:10:14 puppetserver1001 rsyncd[21358]: rsync on puppet_volatile/ from puppetserver1003.eqiad.wmnet (10.64.0.23) [18:59:38] interesting [19:04:02] hm [19:05:08] this is also interesting, I assume just an artifact of puppetization as it is https://phabricator.wikimedia.org/P64549 [19:07:39] oh, I guess the other thing that I realized -- the 'trigger' could be either that puppet-merge, or, it could have been the top of the hour / the day transition [19:07:51] can't falsify either from the data we have [19:08:07] I am probably giving up on diagnosing this [19:08:29] or at least on actively trying to; I can tell it's gonna sit in the back of my head for a while 🙃 [19:25:33] 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113 (10CDanis) 03NEW [20:17:45] 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9877330 (10CDanis) [20:56:24] 07Puppet, 06Infrastructure-Foundations: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119 (10CDanis) 03NEW [20:57:28] would appreciate any input on T367119 :) [20:57:29] T367119: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119 [21:11:13] will do cdanis thanks [22:07:18] 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations: Update fundraising mail settings to use new production mx hosts - https://phabricator.wikimedia.org/T366740#9877740 (10Dwisehaupt) @jhathaway Thanks. I have shifted our codfw hosts to use the new mx-out hosts. That is the secondary datacenter. We'... [22:12:24] 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations: Update fundraising mail settings to use new production mx hosts - https://phabricator.wikimedia.org/T366740#9877747 (10jhathaway) >>! In T366740#9877740, @Dwisehaupt wrote: > @jhathaway Thanks. I have shifted our codfw hosts to use the new mx-out... [23:23:45] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed