[12:01:06] kind reminder that the pad for today's meeting is up :-P [12:04:39] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10Infrastructure-Foundations, 10serviceops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10hashar) That is still happening from time to time. Any person or team I can raise th... [12:48:05] moritzm: FYI I'm looking at the logs for testvm2002 and so far it's weirdness all the way down [12:57:46] "good" I guess? [12:59:24] not really :) nothing adds up so far [13:03:18] I'll join as soon as google meet lets me log in [15:25:08] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by volans@cumin1001 for host mw1414.eqiad.wmnet [15:29:58] I've merged the patch to filter mx2001, but 'homer "cr*" diff' on cumin2002 is giving me a traceback, is that cosmetic/safe to proceed or could I proceed with the "merge" nonetheless? [15:29:59] https://phabricator.wikimedia.org/P17267 [15:31:54] moritzm: looking [15:31:58] thx [15:32:27] no, t's not normal [15:32:58] I'm triggering all the edge cases today :-) [15:33:10] XioNoX: anything changed recently? fails to get 'inventory'... [15:34:25] unrelated, you can run homer as your user ;) [15:34:51] I' checking [15:35:28] ah, i didn't know [15:37:01] moritzm: I can't repro [15:37:06] Homer run completed successfully on 13 devices [15:37:11] Changes for 6 devices: ['cr2-esams.wikimedia.org', 'cr3-eqsin.wikimedia.org', 'cr3-esams.wikimedia.org', 'cr3-knams.wikimedia.org', 'cr3-ulsfo.wikimedia.org', 'cr4-ulsfo.wikimedia.org'] [15:37:57] could you retry? also you can directly run commit, it will show you the diff and ask for confirmation anyway [15:38:33] maybe a brief network issue connecting to netbox? retrying now [15:38:45] thx [15:42:35] also worked on a second attempt, merging now [15:42:54] I'll quickly check the netbox logs [15:43:05] ack, thx [15:44:24] moritzm: do you have a timestamp of the above exception? [15:45:24] I only pasted what was printed on screen, but let me check the homer logs on cumin2002 [15:45:40] ahem... [15:46:39] not sure we've ever added them in the end... or I'm misremembering [15:46:46] neither I can recall why [15:47:22] that explains why I can't find anything on cumin2002 :-) [15:51:36] nothing obvius on netbox logs, there was a restart but after AFAICT [15:52:05] actually not a full restart, just workers kill+respawn [15:52:10] and that should be transparent [15:58:54] moritzm: when you have a minute, I have a couple of question for the other weird issue of today :) [15:59:38] shoot :-) [16:00:00] so the reason you got nothing to commit is that the VM was not deleted from netbox until few minutes later [16:00:17] now the cookbook starts the systemd unit thtat should sync ganeti to netbox and delete it [16:00:30] but in the logs of the unit (journalctl) there is no trace of such deletion [16:00:45] but in syslog there is a run at the correct time, that deleted it from netbox [16:00:51] how's that possible? [16:00:58] if the systemd timer didn't start it [16:01:03] and the cookbook neither [16:01:14] why it run at that time? and why outside of journalctl? [16:01:27] good question! [16:01:52] this is just the ganeti test cluster, I'll create the VM tomorrow and we can try if we can repro? [16:02:50] sure, but I doubt it will repro... just a feeling :) [16:03:30] hehe, we'll see :-) [16:03:51] ack [16:06:36] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - mw1414 (**PASS**) - Downtimed on Icinga - Depooled the following services from conf... [16:09:17] 10Mail, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 (10MoritzMuehlenhoff) mx2001 is now filtered on the routers, in case there are any issues, this can be reverted by merging https://gerrit.wikimedia.org/r/720783 and running 'homer "c... [16:56:50] FYI I'll reimage sretest1002