[00:15:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:18:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:48:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:48:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:48:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:48:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:15:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:18:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:15:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:18:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:48:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:08:45] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:15:46] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:18:45] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:45:46] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:48:45] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:03:27] <kostajh>	 hi effie, is there a way for us to verify that the files for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037528 have synced everywhere?
[07:05:46] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:20] <volans>	 XioNoX, topranks: there are pending changes for "et-1-0-2-103.cr2-codfw" in the netbox/dns cookbook
[08:26:29] <wikibugs>	 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: accounting report failure - https://phabricator.wikimedia.org/T366874#9874419 (10Volans) 05Open→03Resolved a:03Volans The problem was that a device had been inserted with a numeric serial that the spreadsheet saved as integer. I've...
[08:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:47:34] <topranks>	 volans: sorry bout that my bad 
[09:09:33] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9874533 (10Volans) I've also manually fixed a bunch of warnings due to a clearly mistyped phabricator task number in the spreadsheet....
[09:18:16] <volans>	 on netbox-next validators/dcim/interface.py has local modifications, I'm not touching it but please revert it once done with the testing
[09:18:29] <volans>	 I'm also re-enabling puppet (was left disabled)
[09:20:25] <volans>	 I'll probably stash it and re-apply it to deploy my changes
[09:30:02] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9874557 (10Volans) 05Open→03Resolved This is now completed. The new runs are not alerting for these hosts with replaced mother...
[09:59:12] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704#9874622 (10Volans) I've took a look today and trying to manually run all the tests there isn't anyone that takes so long to trigger the 300s timeout,...
[10:48:15] <topranks>	 volans: those changes were left over from my testing, I've reverted it now 
[10:48:27] <topranks>	 (in interface.py validator0
[10:49:31] <volans>	 k
[10:49:33] <volans>	 thx
[10:49:43] <volans>	 I've already deployed stashing them and re-applying them
[10:49:45] <volans>	 just in case 
[11:29:43] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704#9874878 (10cmooney) >>! In T321704#9874622, @Volans wrote: > I've took a look today and trying to manually run all the tests there isn't anyone that...
[11:41:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (91.07%) on ganeti1017:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[12:26:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (93.65%) on ganeti1017:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[12:33:29] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9875143 (10cmooney)
[12:40:14] <elukey>	 hello folks!
[12:40:24] <elukey>	 I am updating eventrouter and kube-state-metrics in aux-eqiad
[12:40:28] <elukey>	 cc: cdanis
[13:36:45] <cdanis>	 thank you elukey !!
[13:37:14] <elukey>	 cdanis: np! Jaeger is already rolled out right?
[13:37:24] <cdanis>	 yes I did that Friday
[13:37:32] <elukey>	 <3
[13:43:05] <cdanis>	 kostajh: is there something specific you needed confirmed about that patch?
[13:43:18] <cdanis>	 just that the GeoLite2 files were available?
[14:05:33] <wikibugs>	 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations: Update fundraising mail settings to use new production mx hosts - https://phabricator.wikimedia.org/T366740#9875478 (10jhathaway) p:05Triage→03Medium
[14:09:11] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797#9875489 (10MoritzMuehlenhoff) I think we should rather base this on a given kernel version? Seems more robust than a given date.
[14:25:37] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9875546 (10cmooney) p:05Triage→03Medium
[14:29:21] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797#9875574 (10Volans) p:05Triage→03Medium
[14:58:09] <kostajh>	 cdanis: yeah just that the files are available. Giuseppe helped confirm that they are. I need to check later in the week that they updated successfully
[14:59:50] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#9875713 (10MatthewVernon)
[15:01:47] <cdanis>	 kostajh: do those files also get baked into the image used for k8s?
[15:06:00] <kostajh>	 cdanis: I’m not sure. Longer term we may try to move the data into an OpenSearch instance https://phabricator.wikimedia.org/T357753#9874390
[15:08:00] <cdanis>	 ah, I hope an OpenSearch instance with an SLA though :)
[15:09:08] <kostajh>	 heh 😈
[15:13:45] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Juniper QFX5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801#9875780 (10cmooney) 05Open→03Resolved We seem to have no such errors being logged any more, either from these switches or the d...
[15:20:10] <claime>	 cdanis: kostajh, these files are deployed by puppet to the k8s nodes and mounted as hostPath
[15:20:16] <cdanis>	 thanks claime 
[15:20:29] <cdanis>	 makes sense
[15:39:07] <cdanis>	 kostajh: claime: it looks like the new files are missing from most of codfw ?
[15:39:26] <claime>	 hmmm
[15:40:08] <cdanis>	 https://phabricator.wikimedia.org/P64540
[15:42:03] <cdanis>	 claime: the new files aren't in the volatile dir on the secondary puppetmasters, honestly not sure if that matters or not
[15:42:41] <kostajh>	 cdanis: I don’t know why that would be the case. Cc effie who worked on this with me
[15:42:54] <cdanis>	 or on the new puppetservers
[15:43:32] <claime>	 I guess the problem is it's not on all puppetservers?
[15:43:36] <cdanis>	 I'm not sure
[15:43:56] <claime>	 It's pulling from volatile
[15:43:58] <cdanis>	 > files should be added to the puppet ca host. they are then rsynced from the other puppetmasteres via a systemd::timer::job (sync-puppet-volatile.timer).
[15:44:49] <cdanis>	 that must be out of date?
[15:45:04] <claime>	 possibly yes
[15:45:19] <cdanis>	 that timer/service only exists on puppetmaster2001 
[15:45:43] <cdanis>	 on the puppetservers it has been ... starting up? for a month and a half?
[15:49:37] <effie>	 _joe_: what did you check with kosta this morning? just to figure out where things are 
[15:50:24] <_joe_>	 effie: that the maxmind files were correctly loaded
[15:50:36] <_joe_>	 in k8s
[15:50:46] <_joe_>	 they were, he thought not IIRC
[15:51:00] <cdanis>	 _joe_: the geolite files are missing from most of codfw.
[15:51:04] <cdanis>	 https://phabricator.wikimedia.org/P64540
[15:51:38] <effie>	 alright cool, we are either missed something in the code, or something in how this mess generally works :p
[15:51:42] <effie>	 have* 
[15:51:58] <cdanis>	 it also looks like there's something quite wrong with the puppetmaster rsync jobs for volatile
[15:52:04] <cdanis>	 but I haven't figured that out yet
[15:52:22] <_joe_>	 cdanis: ah sigh, ofc I checked an eqiad node and pod
[15:52:33] <cdanis>	 ● sync-puppet-volatile.service - rsync puppet volatile data from primary server                                                                                       
[15:52:35] <cdanis>	      Loaded: loaded (/lib/systemd/system/sync-puppet-volatile.service; static)                                                                                        
[15:52:37] <cdanis>	      Active: activating (start) since Sat 2024-04-27 00:12:03 UTC; 1 month 14 days ago
[15:52:39] <cdanis>	 like, *what*
[15:52:40] <_joe_>	 sigh
[15:52:55] <_joe_>	 ok that kind of explains it I guess :P
[15:53:08] <_joe_>	 there's more data that is out of sync in codfw than just geoip :/
[15:53:57] <cdanis>	 the timer doesn't even exist on some puppetmasters _joe_ 
[15:54:49] <_joe_>	 cdanis: it should be only on the frontends
[15:54:59] <_joe_>	 because that's where volatile is served from
[15:55:21] <_joe_>	 so 1001, 2001, and whatever puppetserver acts as a frontend 
[15:55:37] <cdanis>	 hm
[15:56:06] <cdanis>	 ok well maybe it's okay then
[15:56:16] <cdanis>	 so then we have to figure out why they're missing from codfw nodes
[15:57:07] <cdanis>	 they are on puppetmaster2001 but *not* on any codfw puppetserver
[15:58:54] <_joe_>	 ok so I guess the volatile sync got borked in the transition to puppetserver?
[15:59:03] <_joe_>	 but i never followed it
[15:59:06] <cdanis>	 something dumber than that happened
[16:00:24] <cdanis>	 https://phabricator.wikimedia.org/P64541
[16:00:37] <cdanis>	 the rsync is literally just hung waiting on a socket that I really doubt still actually exists
[16:01:30] <volans>	 :(
[16:02:31] <cdanis>	 I restarted it
[16:02:35] <cdanis>	 it has synced
[16:03:10] <cdanis>	 the GeoLite2 files are there
[16:04:35] <cdanis>	 also I can verify now that new puppet runs are fixing it in codfw
[16:05:12] <cdanis>	 I'm going to kick off puppet runs on all codfw hosts that should have the file (per puppetdb) and don't
[16:05:49] <volans>	 +1 
[16:06:33] <cdanis>	 we uh
[16:06:45] <cdanis>	 we should probably put a timeout
[16:09:58] <cdanis>	 on the job overall
[16:10:02] <cdanis>	 possibly, by default on all timers
[16:10:55] <volans>	 yeah probably better with RuntimeMaxSec or similar on the systemd side than finding for each one their own timeout and hoping it will be respected
[16:11:02] <cdanis>	 yeah
[16:11:29] <volans>	 ideally with an alert when it does times out
[16:11:58] <volans>	 and probably RuntimeMaxSec will automatically trigger the failed systemd unit one, so we should be good ont hat side
[16:16:43] <cdanis>	 yeah
[16:16:47] <cdanis>	 that's what I was thinking
[16:16:51] <cdanis>	 and in this case, a simple restart fixes it
[16:17:09] <volans>	 makes sense to me
[16:19:44] <cdanis>	 volans: the main thing I'm wondering right now is if we need to do any more work to look for other impacts, or, to write an incident report
[16:20:13] <volans>	 from the puppet runs what else did you notice in terms of diffs?
[16:20:25] <cdanis>	 I haven't looked
[16:20:28] <cdanis>	 and
[16:20:36] * volans doesn't recall if volatile files are shown in the diff
[16:20:40] <cdanis>	 I'm only running puppet by hand on `C:geoip::data::puppet%fetch_ipinfo_dbs=true`
[16:20:49] <volans>	 by now it has run everywhere 
[16:20:56] <cdanis>	 not quite yet :)
[16:21:04] <cdanis>	 it's only been about 15 minutes
[16:21:14] <cdanis>	 my cumin run is 62%
[16:21:24] <volans>	 how small was your batch size? :D
[16:21:29] <cdanis>	 8
[16:21:34] <volans>	 too small :D
[16:21:40] <cdanis>	 ok I wasn't sure if that was still true or not
[16:21:41] <volans>	 for puppet 
[16:23:29] <volans>	 I *think* we're in a much better state now, I also think to remember that someone run a large batch without batch size, so technically capped at 64 by the cumin's config
[16:23:35] <volans>	 and puppetservers/masters didn't die
[16:23:43] <volans>	 prior work was done in T280622
[16:23:43] <cdanis>	 cool
[16:23:44] <stashbot>	 T280622: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622
[16:23:46] <cdanis>	 I restarted with 24
[16:24:05] <cdanis>	 it is only rerunning puppet if it doesn't exist, so
[16:24:10] <volans>	 so I'd say 30~40 should be totally fine, and probably we can live even without batch nowadays between the split of masters/servers
[16:24:15] <volans>	 ack
[16:35:18] <cdanis>	 volans: Notice: /Stage[main]/Geoip::Data::Puppet/File[/usr/share/GeoIPInfo/GeoIP2-Enterprise.mmdb]/content: content changed '{sha256}c00c314d96ee26c978584df4bb6072519ff76196abd45cb3b9a4e9de78d85653' to '{sha256}293719dd92dc4522dd381f0632576ae87525205c6d9e1e77c21618345ab947a3'
[16:35:23] <cdanis>	 so yeah it does get logged 
[16:36:01] <cdanis>	 uhmmm
[16:36:04] <cdanis>	 except it's still missing from some nodes
[16:36:20] <volans>	 nice, so potentially we could look into puppetdb for other volatile files, but there is no mention of "volatile" so it would be a bit of a mess to find them all
[16:36:43] <volans>	 probably easier to search for modified time files in volatile on the puppetmaster
[16:36:59] <cdanis>	 are *all* of the puppetservers frontends?
[16:37:00] <volans>	 and see if there is anythign that could generate user visibile issues
[16:37:02] <cdanis>	 did that change
[16:37:10] <volans>	 jhathaway: ^^^
[16:37:28] <cdanis>	 I'm guessing that's true and I actually need to restart the systemd service backing the timer on all of them
[16:38:06] <jhathaway>	 yes they are all frontends
[16:38:10] <cdanis>	 ack
[16:40:05] <cdanis>	 https://phabricator.wikimedia.org/P64542 for posterity
[16:41:11] <cdanis>	 (6) puppetmaster2001.codfw.wmnet,puppetserver[2001-2003].codfw.wmnet,puppetserver[1002-1003].eqiad.wmnet                                                              
[16:41:13] <cdanis>	 ----- OUTPUT of 'systemctl stop sync-puppet-volatile' -----                                                                                                           
[16:41:15] <cdanis>	 Warning: Stopping sync-puppet-volatile.service, but it can still be activated by:                                                                                     
[16:41:17] <cdanis>	   sync-puppet-volatile.timer                                                                                                                                          
[16:48:10] <jhathaway>	 cdanis: too bad we didn't have any monitoring around that timer, probably valuable to add
[16:48:39] <cdanis>	 jhathaway: yeah I'm ... I'm really curious what made it hang on *all* the other servers around that time
[16:49:03] <volans>	 does it align with anything in SAL?
[16:49:11] <jhathaway>	 for sure
[16:49:13] <cdanis>	 I haven't looked, I've been focusing on meeting or fixing it
[16:49:27] <cdanis>	 I think as volans said we can just add RuntimeMaxSec to the timer (perhaps a very long default to all timers, to prevent a really weird occurrence like this again)
[16:52:46] <cdanis>	 okay so, after fixing the sync job on the puppetservers, now we have the new files on all the codfw hosts
[16:53:18] <volans>	 \o/
[16:53:26] <volans>	 thanks for digging into this
[16:53:30] <sukhe>	 indeed!
[16:53:57] <cdanis>	 I'll write up at least a Phab task about this in my afternoon
[16:55:02] <jhathaway>	 cdanis: regarding RuntimeMaxSec "Note that this setting does not have any effect on Type=oneshot services, as they terminate immediately after activation completed.", instead you want TimeoutStartSec
[16:55:10] <cdanis>	 ah, thanks
[16:55:26] <jhathaway>	 there is quite a bit of misuse of RuntimeMaxSec in our puppetry, need to cut a phab task to address it
[16:55:58] <volans>	 doh, sorry, I did say "RuntimeMaxSec or similar" to cover my back :D
[16:56:14] <volans>	 thanks for the clarification
[16:56:56] <jhathaway>	 :)
[17:02:54] <effie>	 cdanis & _joe_  thank you for figuring out the mess there, my impression last week was "everything willl be in place on monday"
[17:03:04] <cdanis>	 I'm glad kostajh kept asking about it
[17:03:24] <cdanis>	 I went to write a simple one-liner in cumin, thinking, oh, they'll all have the same checksum, it will be easy to verify that it's working
[17:03:40] <cdanis>	 and then after briefly going down a rabbithole of the Puppet grammar I got very concerned
[17:03:55] <effie>	 hehe
[17:05:59] <_joe_>	 I only literally gave out the wrong info :D
[18:06:24] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9876833 (10wiki_willy) Thanks @Volans, will do on the remaining Netbox errors.   >>! In T358542#9874557, @Volans wrote: > This is...
[18:40:06] <cdanis>	 ok so
[18:40:08] <cdanis>	 uh
[18:40:10] <cdanis>	      Active: activating (start) since Sat 2024-04-27 00:12:03 UTC; 1 month 14 days ago
[18:40:20] <cdanis>	 Apr 27 00:12:03 puppetmaster1001 /puppet-merge.py: (private) Starting merge for: /var/lib/git/labs/private
[18:40:22] <cdanis>	 Apr 27 00:12:04 puppetmaster1001 /puppet-merge.py: (puppet) Starting merge for: /var/lib/git/operations/puppet
[18:40:50] <cdanis>	 this can't be coincidence right
[18:42:19] <cdanis>	 jhathaway: it's hard to say this is a smoking gun because 1) how could this affect rsync'ing files from volatile and 2) it's been a month and a half
[18:42:25] <cdanis>	 but that timing is really weird.
[18:44:56] <jhathaway>	 hmm
[18:46:04] <cdanis>	 possibly just coincidence
[18:46:10] <cdanis>	 all the other rsync jobs got stuck later
[18:46:17] <cdanis>	      Active: activating (start) since Sat 2024-04-27 00:08:19 UTC; 1 month 14 days ago
[18:46:21] <cdanis>	      Active: activating (start) since Sat 2024-04-27 00:10:38 UTC; 1 month 14 days ago
[18:46:24] <cdanis>	      Active: activating (start) since Sat 2024-04-27 00:12:03 UTC; 1 month 14 days ago
[18:48:51] <jhathaway>	 nod, very strange, this is regular rsync, not rsync over ssh?
[18:49:15] <cdanis>	 yeah, via our own rsync module
[18:50:28] <jhathaway>	 right, seem surprising it would just hang like that
[18:51:28] <cdanis>	 yeah, I'm not sure what happened, aside from the one client side was blocked on select() waiting for the socket to be readable
[18:51:39] <cdanis>	 the one client side I straced before just killing it, I mean
[18:53:02] <jhathaway>	        --timeout=SECONDS        set I/O timeout in seconds
[18:53:04] <jhathaway>	        --contimeout=SECONDS     set daemon connection timeout in seconds
[18:53:11] <jhathaway>	 these look interesting, if we don't already set them
[18:53:14] <cdanis>	 yeah seems like we should be setting those
[18:56:03] <cdanis>	 here's something else interesting: puppetserver1003, which didn't get hard stuck ... also didn't work for a ~1.5hr window around the time of the issue
[18:56:07] <cdanis>	 puppetserver1001/syslog.log-20240427.gz:Apr 26 23:55:21 puppetserver1001 rsyncd[1557293]: rsync on puppet_volatile/ from puppetserver1003.eqiad.wmnet (10.64.0.23)
[18:56:09] <cdanis>	 puppetserver1001/syslog.log-20240428.gz:Apr 27 01:10:14 puppetserver1001 rsyncd[21358]: rsync on puppet_volatile/ from puppetserver1003.eqiad.wmnet (10.64.0.23)
[18:59:38] <jhathaway>	 interesting
[19:04:02] <cdanis>	 hm
[19:05:08] <cdanis>	 this is also interesting, I assume just an artifact of puppetization as it is https://phabricator.wikimedia.org/P64549
[19:07:39] <cdanis>	 oh, I guess the other thing that I realized -- the 'trigger' could be either that puppet-merge, or, it could have been the top of the hour / the day transition
[19:07:51] <cdanis>	 can't falsify either from the data we have
[19:08:07] <cdanis>	 I am probably giving up on diagnosing this
[19:08:29] <cdanis>	 or at least on actively trying to; I can tell it's gonna sit in the back of my head for a while 🙃
[19:25:33] <wikibugs>	 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113 (10CDanis) 03NEW
[20:17:45] <wikibugs>	 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9877330 (10CDanis)
[20:56:24] <wikibugs>	 07Puppet, 06Infrastructure-Foundations: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119 (10CDanis) 03NEW
[20:57:28] <cdanis>	 would appreciate any input on T367119 :)
[20:57:29] <stashbot>	 T367119: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119
[21:11:13] <jhathaway>	 will do cdanis thanks
[22:07:18] <wikibugs>	 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations: Update fundraising mail settings to use new production mx hosts - https://phabricator.wikimedia.org/T366740#9877740 (10Dwisehaupt) @jhathaway Thanks. I have shifted our codfw hosts to use the new mx-out hosts. That is the secondary datacenter. We'...
[22:12:24] <wikibugs>	 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations: Update fundraising mail settings to use new production mx hosts - https://phabricator.wikimedia.org/T366740#9877747 (10jhathaway) >>! In T366740#9877740, @Dwisehaupt wrote: > @jhathaway Thanks. I have shifted our codfw hosts to use the new mx-out...
[23:23:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed