[00:02:57] left comments on your puppet change, the -1 is just for the binary [00:26:52] ryankemper: hows it going? [00:27:46] legoktm: there's some context in #wikimedia-search, but it looks like we would need to also package the dependency `pthread_do` first, then package the script itself [00:27:51] this binary will only be run on `elastic*` hosts owned by the search team; given that we're putting this in place on a temporary basis, do you think it's acceptable to ship the binary as-is? [00:28:02] this binary will only be run on `elastic*` hosts owned by the search team; given that we're putting this in place on a temporary basis, do you think it's acceptable to ship the binary as-is? For context the default approach historically wrt this binary was opening up a tmux session on each elastic* host to run this every 30 minutes which would be a bit gross, [00:28:15] oops double-copied, ignore the first half of that last line [00:28:45] For context the default approach historically wrt this binary was opening up a tmux session on each elastic* host to run this every 30 minutes which would be a bit gross, so the idea was to get it actually in puppet (on a temporary basis) for transparency's sake (and to avoid any issues with a tmux session getting killed by some automated cleanup job or something...not sure if that happens tho) [00:31:57] what / where is pthread_do? [00:33:00] committing binaries to puppet is just sketchy in that it's generally hard to reproduce and externally verify what's being shipped is what's intended, that the stuff you're linking against is installed on the host, and (less important) it bloats the repo size for everyone. [00:33:27] I'll remove my -1 if you'd like but I don't think it's a good idea [00:34:02] found the pthread_do source, it was in the paste [00:37:45] sorry was making your other suggested changes [00:38:11] legoktm: I believe that `pthread_do` is statically linked, are the concerns about linking centered around dynamic linking? [00:43:46] totally agreed re the concerns about reproducibility / external verification btw. I'm hoping that given this binary will be around for ~2 weeks tops that the tradeoff is acceptable. we've ran this exact binary on a few hosts this last week so we know it does what we want (obv it could be doing "other bad things" in addition but that's more of a theoretical concern imo) [00:44:29] and again this is being ran on hosts that do search stuff and search stuff only [00:49:23] I think I have a Debian package working [00:50:33] at the risk of being premature: \o/ [00:51:27] root@a1d6a49f868d:/src# dpkg -c ../elasticsearch-madvise_0.1_amd64.deb [00:51:27] -rwxr-xr-x root/root 22888 2021-07-02 00:41 ./usr/bin/madvise [00:52:46] ryankemper: this is all buster right? [00:53:02] legoktm: yes [00:58:17] I pushed it to https://gitlab.com/legoktm/es-thingy [00:58:35] it's kind of terrible but it appears to build and work [00:58:49] and by work I mean it spits out a usage message when you run it [00:59:21] ryankemper: the name "elasticsearch-madvise" won't conflict with anything right? [01:00:04] spits out usage message sounds right :P [01:00:12] legoktm: yeah I can't imagine it conflicting with anything [01:01:44] legoktm: okay so now that we have this repo w/ the makefile, I'll want to use `dh_make` to turn this into an actual deb yeah? [01:02:41] ryankemper: oh, I already took care of that. building it on deneb right now [01:02:51] ack, ty [01:05:30] ryankemper: there's a deb at deneb.codfw.wmnet:/var/cache/pbuilder/result/buster-amd64/elasticsearch-madvise_0.1_amd64.deb, do you want to try installing that on a search host and make sure it works before I upload it to apt.wm.o? [01:05:53] legoktm: yes, will report back in a few mins [01:07:07] the executable should be installed at /usr/bin/elasticsearch-madvise [01:14:11] legoktm: looks good. will take a couple minutes to see if the effect on io I expect happens, but the output looks correct; you're clear to proceed [01:14:22] sweet [01:17:22] ryankemper: ok, package "elasticsearch-madvise" is uploaded [01:20:17] legoktm: thanks. stupid question time: in the `package` resource am I going to have to give it an `apt.wm.o` URL to grab the package from or will it basically just look like: [01:20:19] https://www.irccloud.com/pastebin/Usqk5b8L/ [01:20:36] nope, that's it [01:21:40] you can also use: ensure_packages(['elasticsearch-madvise']), which works better when multiple manifests want to install the same package [01:22:05] (but I don't think that's relevant here) [01:26:41] ryankemper: I'm going afk to make dinner, ping me if you need help with anything else and I'll see it on my phone [01:27:23] legoktm: thanks for all the help. just finished switching to the apt package / moved the location of the wrapper file so it's not in `files/cirrus` anymore [01:28:08] legoktm: could use one last review if you get the chance: https://gerrit.wikimedia.org/r/c/operations/puppet/+/702754 but go eat first! [01:36:04] ryankemper: lgtm! [01:36:30] ty! [02:23:32] Just to circle back, looks like puppet was having trouble finding the package https://www.irccloud.com/pastebin/MtRIeuVs/ [02:28:36] legoktm: Ah these hosts are on stretch, not buster /facepalm...totally my mistake [02:31:10] Gonna take a swing at building for `stretch` on `deneb` [02:36:45] ryankemper: were you able to build it? [02:37:37] legoktm: almost. rusty on debian changelog. Do I need to change `elasticsearch-madvise (0.1) buster-wikimedia; urgency=medium` -> `elasticsearch-madvise (0.1) stretch-wikimedia; urgency=medium`? [02:37:43] yeah [02:37:54] then [02:37:57] gbp buildpackage -sa -uc -us --git-dist=stretch-wikimedia [02:37:58] then `gbp buildpackage -sa -uc -us --git-dist=stretch-wikimedia` yeah? [02:38:02] great [02:38:42] https://wikitech.wikimedia.org/wiki/Reprepro has the rsync command (or peek in my apt1001 bash_history) [02:40:55] legoktm: `lintian` is complaining about incorrect version number. Do I need to stack a second change on the changelog, or should I just bump the version up? [02:41:10] https://www.irccloud.com/pastebin/wI7vFUCZ/ [02:41:54] This is what it looks like currently after my change. `/var/cache/pbuilder/result/stretch-amd64/` didn't change at all [02:43:54] here's the full output of the gbp run https://www.irccloud.com/pastebin/cneR7t9M/gbp%20buildpackage%20-sa%20-uc%20-us%20--git-dist%3Dstretch-wikimedia [02:53:36] Give me 5 min [03:00:59] ryankemper: ok, I built elasticsearch-madvise_0.1~deb9u1_amd64.deb [03:01:02] uploading it now [03:01:23] legoktm: many thanks [03:01:29] the main problem was that the debhelper version was >= 11, and stretch only has 10 [03:02:03] ah [03:02:16] uploaded [03:14:39] legoktm: thanks again for helping me out this late. package is working perfectly :) [03:15:01] :)) awesome [03:15:06] Happy to help! [08:49:20] hi, can we get permission to emergency backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/702808 ? [08:50:00] (low-grade emergency, there is a bug in some opt-in logic so all new users get opted out of the feature. but the next normal backport window is in ten days...) [08:55:42] tgr: hi! https://phabricator.wikimedia.org/T285996#7192814 looks a reasonable motivation to me, and the fix seems very limited in scope, but let's seek some other option just in case [08:56:22] hashar: o/ thoughts about --^? (I'll try to ping service ops too) [08:57:27] (pinged) [08:58:18] other than not deploying, I'm not sure there are other options [08:59:28] <_joe_> +1 let's deploy it [08:59:48] perfect [09:00:09] <_joe_> as long as we do so now and not later in the day [09:00:51] tgr: yep let's do it now if you have time [09:01:39] thx, doing [09:03:56] thanks all [09:04:14] tgr: I can help with verification w/ mwdebug [09:09:55] elukey: my rule of thumb is that developers better know whether it is sane to deploy on friday or not cause they know about how broken something is and the potential impact a fix might have :] [09:10:29] so yeah +1 on deploying :] [09:40:37] deployment done, thanks! [09:48:12] FWIW there is another emergency backport request in -operations. [09:50:25] yep just answered it [13:53:57] marostegui: We're planning to upgrade all the eqiad rows over the last two weeks in July (as per discussion on the task). [13:54:24] If it helps we could maybe switch row A from the first day (Tues 20th,) to the last (Thurs 29th)? [13:54:29] Just a thought. [13:56:37] topranks: we had no idea that was coming and we did ask about any network maintenance plans, to accommodate for our already existing maintenance [13:57:04] topranks: maybe leaving row a for the last one can be useful, I will come back to you on Monday [13:57:44] From our misc services (the only thing that isn't running in codfw) row A is the most impacted [13:57:56] ok thanks. Our apologies it came up only recently on the back of problem reports on the backup speeds. [13:58:18] I think 20th for row A will definitely hard for us to get ready for [13:58:19] We probably should have scheduled earlier but we were trying to get the other kit installed to have better guidance on the impact before doing so. [13:58:54] Noted for future we will make better effort to make people aware in advance. [13:59:14] I will get back to you on Monday so we can discuss rows order if you like [13:59:31] but leaving A for the last one is definitely something that we'd appreciate [14:01:21] I don't think it makes a difference to us so I will swap the planned times for row D and row A. [14:01:32] let me get back to you on Monday [14:01:55] so I can double check [14:02:54] cool [14:03:00] * marostegui will be working next week and off the following [14:05:14] I'll be on here and there next week so I can work through it with you no probs. [14:06:50] https://twitter.com/jhanikhil/status/1410713976695144450 [14:09:10] marostegui, topranks, I'm sure we can push the switchback to as late as we need to do our maintenances without rushing nor stepping on each other toes [14:11:39] big 👍 to giving us enough time to comfortably do all the needed maintenance before we switch back [14:13:51] topranks XioNoX I would like to start with rows that have no serivces impacted so we can also measure how long the downtime is [14:15:18] marostegui: no probs, I'll have the tasks for the other rows with the lists for those done today and we can assess which are the lowest risk. [14:20:59] sounds good [14:47:17] i'm running decom for a host (dbstore1004), and got to the "Generating the DNS records from Netbox data" step. it's got a bunch of unrelated changes in the diff.. [14:48:34] some of it appears to be T286044 [14:48:35] T286044: remove payments100[1-4] from service and prep for decom - https://phabricator.wikimedia.org/T286044 [14:50:25] ok, it's all related to that. there are multiple hostnames for those hosts, apparently. [14:56:29] topranks: when you say all the eqiad rows... how does this impact services that only exist in eqiad? [14:57:09] It will or won't affect them to the extent they are resilient across rows, if that makes sense. [14:57:36] So we are doing the rows on different days. Servers in different rows will not be affected simultaneously. [14:58:47] So if a service runs on boxes in row A and row B, and is set up so that it can failover or clients retry the alternate or whatever, then it shouldn't be affected. [14:58:55] ok well for example, if it impacts a row with a snapshot or dumpsdata host, it needs to be scheduled, can you coordinate with me about that? [14:59:14] I suppose the same is true for the labstore 1006 and 7 hosts but wmcs folks will need to discuss that [14:59:49] If a service only runs on servers in a single row it will definitely be affected. It would also affect services spread across rows that don't have some mechanism to fail over. [15:00:26] that's why we need corrdination and scheduling for some thing [15:00:27] s [15:01:24] is there a task I should be watching? [15:01:27] apergos: no problem yep I'm just completing the other tasks and indeed, the aim is co-ordinate with everyone. As Arzhel said if folks need more time to make plans it should be no problem to push the changes out a bit. [15:01:43] ok that would be awesome [15:01:45] I've just got the one for Row A up now (note date subject to change) [15:01:50] gotcha [15:02:00] is there an umbrella one for all these? [15:02:17] otherwise if there's a place to look to see new ones I can just stalk that [15:02:25] Row A: https://phabricator.wikimedia.org/T286032 [15:02:36] Ubmrella: https://phabricator.wikimedia.org/T284592 [15:03:08] I'm adding the VMs on the Ganetti hosts to the row A task now, someone suggested that I'd left them out. [15:03:25] ah perfect, I'll stalk the second one [15:06:03] XioNoX: hii. i'm trying to rename a host, and the decom step to update homer didn't, uh, happen. is there some way to do this manually but also in a way i can't fuck it up? [15:07:08] kormat: are you following https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging ? [15:07:12] volans: yep [15:07:51] and what's failing / didn't happen? [15:07:54] volans: when i ran the decom cookbook the first time, i must have sent some invalid input at the homer prompt while trying to copy a dns diff for the task i mentioned above, so the cookbook aborted. re-running it doesn't do the homer step [15:08:27] kormat: what's the server? [15:08:32] https://phabricator.wikimedia.org/P16761 [15:08:35] XioNoX: dbstore1004 [15:09:04] I guess that running homer for the related switch should be enough, but I'll leave you in the hands of XioNoX ;) [15:09:10] as I should be... off [15:09:33] yep exactly [15:10:23] ok cool. running diff against `"asw2-b-eqiad"` currently [15:10:30] let's see if the result looks anything like the original diff :) [15:10:45] eh, I'm running it too [15:11:11] double the efficiency? [15:11:16] you will need the asterisk btw "asw2-b-eqiad*" [15:11:43] ah, i had it, just failed to reproduce for irc above. failed anyway with conf db locked [15:11:44] kormat https://www.irccloud.com/pastebin/sTeRcSrx/ [15:12:05] eh, I won the race-condition [15:12:09] :) [15:12:17] you're all set [15:12:19] XioNoX: that matches the expected result [15:12:30] as in you're running commit, or i should? [15:12:37] kormat: I did [15:12:50] ah hah. thanks! 💜 [15:15:36] * kormat facepalms [15:15:47] i just deleted the host from netbox. not the interfaces. fuuu. [15:16:25] i don't suppose there's an undo button somewhere? [15:18:04] https://netbox.wikimedia.org/extras/changelog/61717/ [15:32:37] * volans|off here [15:32:42] kormat: let me read backlog [15:33:00] there's just 4 lines of relevant stuff [15:33:21] so here's the changelog https://netbox.wikimedia.org/extras/changelog/ [15:33:58] unfortunately netbox doesn't have an undo button, but we can do various things [15:34:20] either re-create from the data there, or pick the data from the backups, in both cases in not a great experience, let me help you [15:35:00] we also have netbox-next with stale data fwiw ;) [15:35:30] kormat: should I re-create it as dbstore1004 or with the new name? [15:35:48] volans|off: the new name probably makes sense (db1183) [15:35:52] T284622 [15:35:53] T284622: Rename dbstore1004 to db1183 and place it on m5 - https://phabricator.wikimedia.org/T284622 [15:40:22] * volans|off doing [15:43:30] ok we have a slight problem [15:43:50] dragonfly-supernode1001 was provisioned in between, so db1183 prev IP was assigned to it [15:44:09] kormat: does db1183 must have 10.64.16.26/22 ? [15:44:14] nope, don't care [15:44:24] is not in dbctl and such? [15:44:32] or puppet or elsewhere [15:44:41] i'll fix where is needed, [15:44:46] ok [15:44:52] but it won't be in dbctl, and i doubt it's in puppet [15:45:11] yeah, not in puppet either [15:45:28] ack [15:46:19] kormat: AAAA DNS records should be created or not? [15:46:25] I guess usually not for dbs but double checking [15:46:34] not, correct [15:47:58] ok, https://netbox.wikimedia.org/dcim/devices/3451/ should be good to go I guess [15:48:37] I did the Run the interface_automation.ProvisionServerNetwork part of the procedure too [15:48:48] kormat: so you should be able to resume from Run the sre.dns.netbox cookbook [15:49:06] volans|off: awesome! thank you so very very much 💜 [15:49:24] i _almost_ regret calling you back to work to help [15:50:09] no worries, errors happen, and I whish Netbox had an easier way to revert changes, the request has been raised upstrem too [17:01:25] +1000 on that volan.s... made some real f--- ups in my time via the netbox API. But don't worry I'm using a RO key these days in WMF ;)