[08:56:33] 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) [08:56:39] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, and 2 others: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10JMeybohm) [08:56:49] 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) 05Open→03Resolved Merged as `sre.k8s.pool-depool-cluster` [09:03:02] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=927fadc1-f5b2-478f-95ce-98bfc47881a9) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and th... [10:31:18] topranks: FYI I'm preparing a homer release with the latest things pending there. But I'll wait for your all clear before deploying it to make sure it doesn't affect the current router upgrade [10:31:32] so no worries if you see patches flying by [10:31:33] ;) [10:31:48] volans: ok thanks [10:32:11] I'm not sure if you need to hold off tbh, the upgrade is taking forever (there is a firmware upgrade requiring 3 reboots of each RE involved) [10:32:22] I'm almost done cr1 but cr2 yet to go [10:32:40] I can run homer from my laptop if somehow it breaks on cumin hosts, so I'd say you go ahead [10:33:17] thanks, but I'm sure you'll be quicker than jenkins :D [10:33:31] finally a race I might actually win! [10:33:43] lol [11:37:36] 10Mail, 10Infrastructure-Foundations, 10SRE: Wikipedia.org DMARC "rua" and "ruf" email addresses need verification - https://phabricator.wikimedia.org/T211401 (10Jgreen) 05Open→03Resolved a:03Jgreen ;; ANSWER SECTION: w.wiki._report._dmarc.wikimedia.org. 3600 IN TXT "v=DMARC1;" ;; ANSWER SECTION: wiki... [12:34:10] 10Mail, 10Infrastructure-Foundations, 10fundraising-tech-ops: Investigate in-house DMARC analysis tool options - https://phabricator.wikimedia.org/T317443 (10Jgreen) [14:20:37] volans: sorry that took so long, process is tedious in the extreme on those ones. [14:21:04] I got sidetracked to various other things, did the relase on the homer repo not yet started on the deploy repo... :D [14:21:10] so you won! [14:21:14] haha :) [14:25:21] volans: if you had a second could you +1 the re-pool of codfw: https://gerrit.wikimedia.org/r/c/operations/dns/+/831889 [14:25:29] (sorry to pick on you but given you're active) [14:26:11] sure [14:26:31] btw for those is quicker to hit the 'revert' button in gerrit [14:26:50] and also leave a trace of the depool + revert in both places [14:26:58] cool thanks. and yeah true never sure what's best [14:27:09] +1ed [14:27:33] hmm ok - I was incorrectly thinking two patches would leave a better audit trail but good to know revert is actually better for that [14:27:35] nice one :) [14:27:54] the revert button on gerrit does create a new patch [14:28:11] just links it to the previous and there is a revert: .... title [14:29:02] like this one: https://gerrit.wikimedia.org/r/c/operations/dns/+/830575 [14:29:13] look who craeted it :-P [14:40:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) cr1-codfw and cr2-codfw sucessfully upgraded today. Took a while with the firmware upgrades too, I've added some notes [[https://wikitech.wikimedia.o... [15:08:29] topranks: is now a good time to deploy homer? [15:11:49] volans: yep now is good I won’t need it for the next while [15:11:51] Thanks [15:11:59] great, proceeding, thx [15:22:17] topranks: have you ever seen this error? [15:22:17] Failed to commit check on cr1-drmrs.wikimedia.org: 'utf-8' codec can't decode byte 0xe0 in position 10: invalid continuation byte [15:22:39] is junos returning not-utf8 stuff? [15:44:23] volans: i think XioNoX has hit this error before but it seemed transient [15:58:34] jbond: ack, thx, it doesn't seem transient though [15:58:44] and the CRs have been upgraded [15:58:55] Sorry was afk [15:59:17] checking any other upgraded cr [15:59:20] I’ve no seen it before no [15:59:42] cr1-esams works fine [15:59:53] Ulsfo, knams, eqsin MX204’s are all upgraded [16:00:06] is drmrs differnt in some way? [16:00:17] cr[1-2]-drmrs are both failing [16:00:28] testing teh other pops [16:00:56] hmm [16:01:05] usually a homer push is one of the last upgrade steps [16:01:28] Commit check error on cr2-eqsin.wikimedia.org: [16:01:29] Error accessing interface container object, may not be defined [16:01:45] and the diff is huge [16:01:54] deleting a whole bunch of stuff [16:02:11] same for cr3-eqsin [16:02:46] sorry I have to go AFK for ~30 minutes... [16:02:55] 99% sure I would have ran against cr3-eqsin after the upgrade to reset OSPF metrics [16:03:01] no probs I'll do a few checks see if I can find anything [16:03:25] ack, thanks, i'll catch up in a bit [16:06:09] It worked ok from my laptop on cr1-drmrs [16:06:10] https://phabricator.wikimedia.org/P34627 [16:37:24] topranks: ok I'm back [16:38:52] ok... definitely have at least 2 issues I think, or maybe 1 issue manifesting itself in 2 odd ways [16:38:59] tell me [16:39:11] A bit of info in the paste above I've added a few replies [16:39:50] netbox.device_plugin.junos_interfaces seems to be empty, generated config misses all the interface config [16:40:03] I'm assuming that's different to the utf-8 / transport issue [16:41:59] mmmh [16:42:40] I think a possible approach might be to focus on the missing interface data - get it producing the correct config - then see if the transport issue persists. [16:43:04] Transport one definitely looks odd, given it pulls the diff when running "homer commit", but throws an error when you do "homer diff" ?? [16:43:50] the sequence of commands is different between diff and commit [16:44:49] we load the config and do a commit_check in the diff case and a commit in the commit case [16:45:22] Do we not do a "commit_check" first in the commit case to produce the interactive diff for the user? [16:45:39] (I'm sure you're right btw so ignore this red herring) [16:45:54] we do self._device.cu.diff() [16:45:59] to get the diff and group them [16:46:13] *to be ablet to group them [16:46:30] Ok. So no "commit check" in the "commit" case prior to it showing the user the diff. [16:46:54] My guess would be something in the generated config is throwing the box off, it's errroing on "commit check" [16:46:58] no, it does commit + commit_check though [16:47:39] Ok... so prior to the user being shown the diff when you run "homer commit" it will have run "commit check" on the device? [16:48:17] no, AFAICT not before showing the diff [16:48:27] that makes some sense [16:48:38] * volans trying to test it locally too [16:48:43] why? [16:48:52] what I suspect is happening is *slightly* invalid config is being loaded. [16:48:59] Not enough to not load, but one that fails commit check [16:49:05] due to the missing ifaces/ [16:49:10] ? [16:49:13] lol [16:49:23] actually yeah. You can't remove the built in mgmt interface I think. [16:49:24] the right amount of invalid [16:49:31] :-P [16:49:41] But you can try to load that config, and it'll accept it at first cos it passes the syntax check [16:49:50] will fail "commit check" cos of something like that [16:50:01] with very clear and useful messag [16:50:12] So I think we're best to concentrate on getting the generated config correct, after which it hopefully will go away [16:50:28] agree, thanks for the insight [16:51:26] let me look at the repo see what changed that might have messed up the generation [16:51:28] topranks: does https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/826559 need to be included? [16:51:46] the only other one in gerrit is the old one from you: https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/769729 [16:53:21] there where all the replacement of old funcs with get_junos_interfaces [16:56:26] The FHRP one doesn't need to be included no [16:57:08] Either does the port block one (reminder to self to merge that when all this is done for next release) [16:57:49] ok [16:58:02] then I guess something happened with the new get_junos_interfaces and the upgraded dependencies [16:58:48] topranks: can I ask you to test something? [16:59:04] yeah of course [16:59:13] in your local env, do you run homer from the .tox/py3... environments? [17:00:50] or install it via pip [17:00:54] or other things :D [17:01:16] "other things" [17:01:19] one sec [17:01:42] basically I'd like you to save your virtualenv, create a new one that has updated deps and see if you get the same errors as prod [17:02:23] as in not generating the interfaces config [17:02:45] but keep your existing one to be able to run homer if needed :D [17:03:13] Yeah so when I first started I installed homer with pip. [17:03:20] and never used any virtualenv [17:04:24] Some time later (this is embarrassing) I removed the /usr/local/lib/python3.8/dist-packages/homer dir and symlinked it to a local copy of the repo [17:04:40] ugh... :D [17:05:01] yeah, I've been meaning to sort it out. I learnt a lot about computers and how they work in the process :) [17:06:29] * volans doing a live test on cumin1001 [17:08:03] ok so self.fetch_device_interfaces() is empty [17:11:09] /srv/deployment/homer/venv-1663081981/lib/python3.9/site-packages/homer_plugins/wmf-netbox.py, on cumin, seems to be the same as I have locally on my laptop [17:12:28] ok, the same API call works fine in isolation, debugging [17:16:40] topranks: I might have a hunch [17:16:42] testing it [17:17:59] topranks: ok found it! [17:18:05] woot! [17:18:08] I'll send a patch for the wmf-plugin [17:18:14] so what is it? [17:18:45] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/deploy/+/refs/heads/master/plugins/wmf-netbox.py#28 [17:18:55] we're storing the result of a pynetbox query [17:19:14] but that query is executed only when called, and returns a generator [17:19:19] once you've consumed it once... it's done [17:19:22] it's empty [17:19:45] so if we use fetch_device_interfaces() more than once, only the first time it returns the actual data [17:20:01] that all makes sense [17:20:12] I missed it in code review [17:20:46] but my local copy has that same function same way, and we get all the interfaces [17:21:15] maybe old pynetbox was returning already a list? [17:21:28] ah perhaps yes [17:23:25] Some ugly debugging locally here: TYPE IS: [17:23:37] spot on that must be chat changed [17:23:43] *what [17:25:58] testing my local change quickly on cumin1001 to see if it fixes everything [17:26:34] topranks: I get https://etherpad.wikimedia.org/p/volans-tmp2 [17:26:40] for cr2-drmrs diff [17:27:09] FWIW I just upgraded Pynetbox locally and it's now doing the same thing (producing config with empty interfaces{}) [17:27:55] volans; ok that's working, OSPF interface costed out that I expect is an oversight from one of the router upgrades [17:28:07] Don't push that to the router - try cr1-drmrs should be no diff I think [17:28:24] I'll work out what's up with the diff for cr2-drmrs, probably safe to change [17:29:22] https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/831936 for the fix [17:29:43] I'm just doing diffs, I'll leave the commits to you :D [17:29:45] testing cr1 [17:30:02] no diff [17:30:32] You can push that commit to cr2-drmrs if you want to test a real commit? [17:30:41] Otherwise leave it to me I'll do it shortly [17:30:42] is that ok to push? [17:30:45] yep [17:30:50] ack doing [17:31:49] committed successfully [17:31:56] woot :) [17:32:02] * volans checking homer's code if we do the same mistake [17:33:15] no, all good there we either convert results on the fly (return [dict(i) for i in self._api....]) [17:33:26] or loop them inline: for foo in self._api..... [17:33:44] I'll merge and deploy the fix for the plugin then [17:34:18] and re-test a * diff [17:35:00] topranks: I think you can consider yourself free to leave now ;) thanks a lot for the help! quickly finding the missing ifaces [17:35:35] nice work! [17:36:09] I pulled your updated wmf-plugin.py locally and homer is now working with the updated pynetbox :) [17:36:16] nice [17:45:04] plugin deployed, testing homer '*' diff [17:48:21] * topranks hold's his breath [17:56:55] it's now at the CRs... so far so good [17:57:10] nice :) [17:57:16] if you can hold your breath for a homer '*' diff that's a guinness world record :D [17:57:57] hahaha [18:02:23] mr1-codfw timedout, and that's what I actually wanted to test/debug... [18:02:27] I'll guess that's for tomorrow now [18:02:44] ah ok [18:03:02] no diffs on the CRs? [18:15:36] topranks: to close the loop, no diffs on the CRs, I got a a diff for lsw1-e1-eqiad, lsw1-e2-eqiad, lsw1-f2-eqiad and then diffs for all mr1-* (that also timedout) [18:28:59] Hmm ok [18:29:22] I’ll have a look a little later thanks [18:30:35] thanks [18:30:41] I'll look at the mr* tomorrow [18:59:13] LSW's diffs pushed. Not sure why but some interfaces weren't in the sflow{} section, even though they were enabled at the port level. [19:00:17] Likely some quirk to do with when the templates were updated vs. when homer was run on them. Not worth losing time on I think. [19:13:52] ack, thx