[09:06:00] volans: re: that ci failure on my CR from yesterday, have you any idea what would be creating a top-level 'build/' dir in ci? that hasn't been an issue up till now [09:06:59] ohh. hum. it seems pip creates that dir. sometimes? [09:07:38] kormat: as I commented I think that having mypy . is not useful as the only thing you want to check it wmfmariadbpy/ AFAICT [09:07:55] i did read your comment, promise :) [09:08:00] :D [09:08:06] I couldn't know for sure [09:08:52] as for build yes is usually managed by pip [09:08:54] behaviour changed in pip 20.1 [09:12:37] kinda wish i had nightly builds so i could detect when the ci env has updated in a way that breaks things [09:15:20] I want those since day 1 fwiw [09:15:41] CI can break with any dependency upgrade randomly [09:16:20] i'd also like the CI env not to be a moving target [09:18:31] you have too many wishes at once ;) [09:23:18] clearly. :) [09:42:04] may I interest someone in switching doc.wikimedia.org from its stretch backend to one running a newer OS? https://wikitech.wikimedia.org/wiki/Doc.wikimedia.org#Switch_primary_host, I have patches prepared: https://gerrit.wikimedia.org/r/c/operations/dns/+/744762/ https://gerrit.wikimedia.org/r/c/operations/puppet/+/744763/ [10:25:11] why is python packaging _so_ bad? https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=955414 [10:31:32] kormat: why is debian packaging _so_ bad? [10:31:49] the badness of python packaging is not debian-specific [10:35:17] I think they build on top of each other xd [10:37:16] dcaro: i think you have a point there, yeah :) [10:37:47] at least something does build *g* [10:40:42] anyway, I'm honestly impressed for such old workflow (20 years?) surviving to today. Will it survive another 20 years? who knows [10:41:55] the Lindy effect says it probably will [10:42:29] * arturo searches for that in certain encyclopedia [10:43:20] https://en.wikipedia.org/wiki/Lindy_effect [12:32:58] so I have a really quick cookbook to run but I need to run them a lot. It spams the tickets a lot. Is there a way to tell cumin to not log start and end (and just one) or put the summary in the second one? [12:33:36] https://phabricator.wikimedia.org/T277354#7543679 here is an example [12:33:49] it needs to run on around 100ish hosts [12:34:18] <_joe_> it needs to run sequentially or you need to confirm before proceeding? [12:34:46] <_joe_> because the natural solution would seem to allow the cookbook to operate on many hosts. [12:34:58] it has to be done sequentially [12:35:12] we can't depool many replicas at the same time [12:38:45] after talking to Manuel, we just not add the ticket to comment, it immediately depools them and that has the ticket, you can easily make the connection in SAL [12:39:22] if you log once before starting them all and put the ticket in that log entry, surely that is good enough (and maybe once at the end) [12:55:28] Amir1: use the hidden flag --dont-tell-volans to suppress that ;) [12:56:00] :P :P [12:57:23] Amir1: the right solution is to convert you script to a cookbook that will do all the work and SAL only once or when you need :D [12:57:44] the downtime cookbook can operate on as many hosts as you want and also downtime only some services and not the whole host if that helps [12:57:47] it's not a script, it's a framwork :P [12:58:10] *icinga services [12:58:43] that's not the issue. It has to be done one by one and might take a day or two to finish [12:58:52] if alter tables are heavy, possibly a week or more [13:00:13] depending on how hacky you want to be there are various options [13:01:58] Amir1: it's 100 times but 1 per day? [13:02:31] one per hour if the alter table is fast, more if it's not [13:03:07] if it takes two hours for alter table, then it'll table 1 per three hours but 100 times (300 hours runtime) [13:03:56] the sal won't be too noisy but it turns the ticket into this: https://phabricator.wikimedia.org/T277354 [13:05:34] for now, I think it's fine to just remove the ticket from the downtime cookbook [13:06:36] ack, that's just based on what args you pass to it, if you want to chat later on better integration I'm always available, [13:09:21] Thanks. Sure. [14:41:50] topranks: thank you for your persistence with cloudvirt1028 -- i had no idea it was going to be such a mystery. My offer still stands to provide you with a second test host if/when you start to suspect that this is host-specific rather than a networking thing. [15:06:19] andrewbogott: Was lucky I did, as it turns out to be a network thing [15:06:30] I worked it out, in the process of documenting it and a fix [15:06:37] awesome! [15:06:41] I think it's probably a race condition we haven't hit before. [15:07:25] Basically it's the issue from this comment, and the one below in the thread: [15:07:25] https://phabricator.wikimedia.org/T296906#7546702 [15:07:54] My knowledge of the DHCP message flow was a little rusty, but after reviewing I could see the DHCP ACKs were not getting back to the host. [15:08:07] They're being dropped by the cr2-eqiad. [15:08:31] that would do it [15:08:48] And I think what is different here is due to a race condition the requests that router forwards are always hitting install1003 first [15:09:21] Anyway the fix is to update our filters so those replies aren't blocked, should get it done today [15:09:40] thanks! [15:09:52] yes, thank you! [15:10:50] volans: I was thrown off as DHCPd response to every DHCP DISCOVER (relayed via either cr1 or cr2), but if duplicate DHCP REQUESTS hit it it will only respond to the first. [15:11:02] *responds [15:11:33] So the DHCP OFFER from install1003 reaches the host, but not the DHCP ACK. [15:11:50] so basically depending on the row of the reimaged host it would always work or never work? [15:12:29] or based on some other routing preferences that would pass through cr2 instead of cr1 [15:12:36] I have to check, but I think our filters may be different for the regular row subnets, so this issue may not exist there. [15:12:47] ack [15:12:50] In this case for cloudsw I believe it may be due to which cloudsw the host is connected to [15:13:08] If it's connected to one which has a direct connection to CR1, probably the broadcast hits CR1 first and it works. [15:14:36] In this case cloudvirt1028 is connected to cloudsw1-d5-eqiad, which is directly connected to CR2. [15:15:00] which explains why the relay from CR2 always hits install1003 slightly before the relay from CR1 (they both relay as they both receive the broadcast) [15:17:05] "similar ETA"™ :-P [15:17:52] lol... now cloudvirt1028 is stuck in traffic for 3 days getting home :) [15:18:47] ahahha [15:55:47] XioNoX: sorry, let's move here [15:56:07] reposting my message [15:56:10] XioNoX: ok since it's all done, so what happened was that I renamed an existing vip_fqdn and it created the new one but didn't remove the old one, which in hindsight I should have expected (?) [15:56:16] sukhe: I'm wondering if puppet have a build in mechanism for that [15:56:36] not really [15:56:42] ensure than nothing else than X exists [15:56:47] you need to ensure=>absent the old thing as you create the new thing [15:56:53] there are some mechanisms for it for some kinds of resources [15:57:29] for instance https://puppet.com/docs/puppet/5.5/types/file.html#file-attribute-purge [15:58:12] sukhe: did it break anycast-hc? or is there a failsafe that keep it running? [15:58:24] XioNoX: yep, it broke it [15:58:27] ah [15:58:28] that's why durum went down :D [15:58:38] haha [15:58:41] then I manually removed the file (the older one) and it resolved [15:59:02] but now I think about it, it's may end up being a common oversight so at least I will document it till we can figure out a way to resolve it [15:59:03] it could maybe be a upstream feature request, check the config before restart [15:59:19] or load 1 service and don't quit if there is a 2nd service with the same IP [15:59:34] yeah definitely, a hard-fail isn't ideal [15:59:55] sukhe: big warning will help, but there must be something we can do in the code to prevent this kind of errors [16:01:00] XioNoX: yeah, I think the easiest is probably on anycast-hc side since I am not sure if we can leverage Puppet for it, at least based on what cdanis said [16:01:14] maybe jbond knows of a better way but I think even then, doing this in anycast-hc feels cleaner [16:02:31] or implement our own failsafe, like before restarting the process, do a grep for each ressource in the config folder and don't restart it if there is more than 1 result [16:03:38] sukhe: https://github.com/unixsurfer/anycast_healthchecker#starting-anycast-healthchecker there is a "--check" option [16:03:59] oh interesting, we can call this from validate_cmd [16:04:02] (Puppet) [16:04:06] yeah exactly! [16:05:29] ah good [16:05:36] that will make the puppet run fail instead of making the service fail [16:06:06] yep (in a meeting, I will submit a CR for this after!) [16:49:57] hm, so I tried simulating if --check would pick the duplicate /32s, and it seems like it doesn't [16:50:04] the code confirms that it doesn't check for that [16:50:26] so while we should add validate_cmd, this will be need to be addressed separately [16:50:29] that's fine [17:23:39] [#mediawiki_security] 17:21 hey graphite1004 is down, see -operations [20:23:47] topranks: want to hand 1028 back over to me or you feeling attached by now? [20:24:07] you'll need to wrestle it from my cold dead hands :D [20:24:27] nah happy to hand it back, I just updated the task there though, I have the console open in front of me (screenshot in task) [20:24:42] If you want me to hit "yes" for that to allow it proceed I can do that no probs [20:26:31] yep, just hit yes and if it completes we'll save partman for when we have another couple days to burn [20:27:22] ok I've done that, it's giving another similar message about LVM volumes etc., I'll do the same on that [20:27:28] ok [20:27:31] I forget, but did we have cookbooks for setting the mgmt password? [20:27:45] or was it still the bash scripts I was once involved in writing [20:28:17] mutante: not 100%, the re-image cookbook asks for the mgmt (iDRAC) password when you run it [20:28:22] presumably to request the PXE boot [20:28:33] mutante: what do you need to do? [20:28:49] volans: know if I should delete the scripts I once made for that or not yet :) [20:28:52] just that [20:29:00] it's just about the "unused classes" ticket [20:29:21] once when the mgmt password was leaked [20:29:28] there is sre.hosts.ipmi-password-reset [20:29:31] we made those bash scripts to change it on everything at once [20:29:41] it took all the mgmt names from DNS repo to begin with [20:29:47] then detected the vendor from SSH banner [20:29:53] can be trashed AFAICT [20:29:58] then used the change password command for the right vendor [20:30:12] can we run that on all at once if we had to for some reason? like if it was leaked again? [20:30:30] accepts a cumin query as hosts selection [20:30:35] ok, cool! [20:30:49] well, then I will trash this entire module. ok, thanks [20:31:06] if you want double check with John, but will have to wait Thu. [20:31:28] oh totally, will upload a bunch of change but not merge today [20:31:31] sounds good [20:33:42] did I indirectly ping you because you have a highlight for the string "cookbook" ?:) [20:34:13] * volans can't confirm nor deny [20:34:19] lol, ok :) [20:35:49] andrewbogott: so the install completed sucessfully :) [20:35:58] topranks: great! I'll have a look. [20:36:17] Thank you for your very deep dive into the dhcp issue [20:36:25] *but* the re-image cookbook failed, so not sure if it's in the right state :( [20:36:33] https://www.irccloud.com/pastebin/snFFHATe/ [20:36:52] I suspect based on the error due to how long I left it on the partitioning screen. [20:37:11] But it should work the next time I guess, if you need to give it another shot [20:37:30] was the partitioning config fixed? [20:38:19] from the reimage cookbook point of view the host never rebooted after the debian installer into the new OS [20:38:46] it times out after ~20m IIRC [20:38:52] topranks: ^ [20:39:28] yeah the partman config wasn't fixed, detail in the task. [20:39:48] I manually hit enter to proceed, but it had been left on that screen ages, so yep timeout for the cookbook. [20:40:12] ack, partman config needs to be fixed in puppet to get it succeed next time [20:54:27] topranks: I can't ssh and also the console is occupied -- are you still connected? [20:54:54] Actually yes sorry [20:55:05] Do you know off hand the keystrokes to detach? [20:59:34] + \ it seems. [20:59:42] I'm out now fire away. [21:12:37] thx [21:14:18] seems like the root password also isn't set. going to reimage yet again and see what I can see [21:15:29] I reckon it'll work ok on a re-image, problem is the normal steps it does after install didn't happen cos of the timeout. [21:53:39] does anyone here know what is the plan and progress for onhost memcached? [21:53:59] (for hot values basically) [21:56:57] Amir1: I know a little about it but I'm out of date -- for the current state I'd talk to e.ffie (who's out until April) and Krinkle [21:57:38] rzl: thanks. my question is that is it deployed and used and if not, any ETA? [21:59:24] Amir1: It is enabled for ParserCache only (not WAN cache), and the plan for adopting it in WAN has been withdrawn for now. There are no plans to resume it at this point. [22:00:14] noted, thanks. [22:00:25] https://phabricator.wikimedia.org/T264604 [22:01:21] thanks [22:01:37] let me read the last comments, those are interesting [22:02:41] In a nut shell: Blind ttl as on-host does is really hard to reconscile with Memcached and the way web apps like MW use/require consistency, atomicity, and reflection and self-change, as well as things like pre-emptive updates based on random chance ahead of expiry time. It became very complicated to make work and would have made it slower than before. The main motivtion for all this was not speed, it was to avoid network congestion. That [22:02:41] has since been resolved by taking the largest values (ParserCache) where the caller validates it at run-time anyway, since those have unique keys. And more over there are 10G links now/soon making it mostly obsolete for a long time to come. [22:03:51] (because until recently, app servers went into CPU overload during high load due to memc misses as result of unreachable memc servers as result of too much traffic on those physical links in the specific way WMF has those connected) [22:05:39] I see. I'm not sure the preemptive generating is really needed or so many other complex parts of WANcache but that's a different story [22:06:20] Krinkle: some context: T297147 I want to add a cache, I feel I probably need to wrap an APCu around a WAN (with longer TTL) [22:06:20] T297147: RevisionStore::newRevisionSlots() needs a cache - https://phabricator.wikimedia.org/T297147 [22:06:40] And this ticket is also one reason behind edits being slow btw [22:07:02] so I thought maybe it can be simply handled by onhost if deployed [22:07:49] Amir1: right. ack on WAN, something I've been focussing on as well (reduce complexity), although pre-emptive is not on my list of things to cut, there are lots of things we could re-evaluate against current needs indeed. [22:08:46] Amir1: I'll reply on task as well, but I vaguely recall a cache being there and being removed. It's worth checking how this code looked before 2020 to see if maybe it worked differently and see why it is how now. Maybe there's a better place to put caching, and/or something that already exists but isn't working correctly. [22:08:52] let me know if need any help on dropping code and simplifying logic, that's slowly becoming my specialty it seems [22:09:32] yeah, I'm planning to ask Daniel but one thing is that it has a todo "we probably need cache" :D [22:09:44] * Krinkle opens toolbox and presents two open hands to Amir1: one showing a butter knife, the an axe. [22:10:05] which pill will you take :D [22:10:18] of course the axe :P [22:10:39] I need to get rest now but I'll be working on it more tomorrow [22:10:46] k [23:07:03] hm, well, now cloudvirt1028 is stuck forever on 'Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title cloudvirt1028 not found yet' [23:07:10] I am definitely cursed