[08:54:29] inflatador: for your eyes, when you get a chance https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128793 [09:37:24] ryankemper: ^ for your eyes too [10:59:58] hi there, infra windows is about to start but we have a train blocker fix we'd like to backport [11:00:05] is any infra work scheduled for today? [13:01:10] godog ACK, I added a few members of DPE SRE to take a look as well [13:01:33] inflatador: sweet, thank you ! [13:01:49] np...can you explain the implications on Search, I'm not sure I get it at first glance [13:03:05] inflatador: that's fair, tl;dr we are renaming mw logging kafka topics from udp_localhost.* to k8s-mw-.*, I'll clarify the commit message too [13:04:53] {{done}} [13:18:29] ACK, thanks [13:19:17] unrelated, we are starting the prod Elastic->Opensearch migration in a few min. We'll begin in CODFW, so impact is not expected [13:22:17] ref https://phabricator.wikimedia.org/T388610 [15:05:21] storcli package was updated and is causing widespread puppet failure. I can look shortly but in case you uploaded it, please take a look [15:14:37] sukhe: I think moritzm is looking at that per T388628 [15:15:51] yes, sorry, should have followed up h ere [16:28:30] hi! is anyone handling the puppet request window today? I have a simple patch that would be great to get out before the weekend [16:46:25] tgr_: the person on clinic duty should be the best person to ping about non emergency stuff and they may be able to send you to the right person (they are the "face" of SRE for the week) [16:47:22] tgr_: I can help, also the clinic duty person this week so that works out. what's the patch? [16:51:22] sukhe: thanks! the patch is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129349 [16:53:05] tgr_: that one is not a small/simple patch and it wasn't +1ed by the Traffic team :) [16:55:35] it's literally a six character change [16:57:12] in text-frontend's vcl :) [16:57:26] note that I also added "simple" [16:57:42] ok [16:57:48] it changes two regular expressions to match multiple times [16:57:52] tgr_: I have a meeting in 3 mins, and I can take it then [16:57:53] does that work? [16:57:57] (after that I meant) [16:58:10] I will need to review it too [16:58:41] tgr_: I agree, but it is nonetheless a VCL change that will be applied on all caching nodes [16:58:46] sukhe: works for me any time in the EU day if it doesn't need to be in the puppet request time window [16:59:09] tgr_: ok, so today is fine or not? because if not then that means Monday. running now but let me know here [16:59:13] I am not against it, I am just suggesting to follow up with traffic beforehand for a rollout, not using the puppet windows [17:02:13] sukhe: yeah I mean today any time is good (as long as it's not too much past UTC midnight) [17:04:14] the lucky bit in this case is that sukhe is in Traffic and also willing to roll it out, but please be mindful the next time [17:05:57] brett OK if I merge `Brett Cornwall: upgrade cp3068 to Varnish 7.1 (cafce4dac4)`? [17:06:11] elukey: ack. What would be the way to coordinate? I want to document this somewhere on https://wikitech.wikimedia.org/wiki/Puppet_request_window (which to be fair does say that varnish patches are unlikely to be accepted) [17:08:11] tgr_: in general something that is rolled out to a broader set of nodes, like caching etc.., should be reviewed by the SRE team that is responsible for them beforehand (so they know that it is happening etc..). Even if the change is small it could cause a big outage, and to be honest I wouldn't be able to say "yes/no" right now without double checking with Traffic first [17:08:31] for other kind of patches, with a smaller blast radious, the puppet window is totally fine [17:08:40] let me know if it makes sense [17:09:13] yeah I get that, I just want to convert it into something actionable [17:09:19] inflatador: Thanks, I did that before seeing your message. My b [17:09:23] add people as reviewers? ask on IRC? file a task? [17:09:45] brett np, merging now [17:10:05] ty [17:10:15] I'd say for VCL changes, add people from traffic as reviewers, for apache changes, people from serviceops, possibly to the task as well for visibility [17:11:42] then coordinate from there, usually apache patches would probably be rolled out during Infrastructure deployment windows [17:12:07] exactly [17:13:10] sorry to be late for the puppet window, I had back-to-back meetings unusually -- but yes exactly what elukey and claime say :) [17:13:28] definitely what rzl says [17:13:43] the puppet window is for patches that any SRE can look at, without knowing anything about the file, and determine is trivially safe to merge [17:13:57] almost no VCL or Lua or Apache change will meet that bar [17:14:08] and my "almost" is really only for, like, fixing a typo in a comment [17:15:36] (note that makes it a very different process from MW backport windows, which can surprise people at first -- but also different is that we can and do merge Puppet patches outside of any window) [17:21:14] for the on-callers: me and nemo-yiannis have been working on https://phabricator.wikimedia.org/T389462 for maps1009, the import of new Open Street Map data is broken due to a postres issue. Puppet is disabled on maps2* (currently DC depooled) and on maps1009 (postgres master) the OSM data import is stopped as well. We'll restart tomorrow. [17:21:20] (sigh) [17:21:42] Thanks for the update! [17:31:14] Thanks all, I edited the page (feel free to improve/revert). [17:32:10] brett: are you taking are of the rollout? feel free to, just checking [17:37:21] I am running the tests in the meantime [17:43:41] tgr_: ready to merge? [17:43:57] you can update the commit message a bit, it's helpful for keeping track of VCL changes [17:44:04] but other than that, read through the task so looks good [17:49:49] sukhe: ready, updated the patch [17:49:59] thank you! [17:50:13] looks good, proceeding. is there a specific text host you want me to try first? [17:52:30] the test request I am doing ends up on cp3071 [17:52:40] ok let's do that then, makes it easy [17:52:46] rolling it out [17:57:25] tgr_: please test, cp3071 [18:01:32] sukhe: it's working, thanks! [18:01:44] ok great [18:01:48] rolling it out everywhere [18:13:30] tgr_: all done, should be everywhere, varnish reloaded [18:14:49] thank you! [22:04:36] we have authdns-update unhappiness: E: staged but uncommited changes present on dns1005.wikimedia.org [22:11:35] this is the authdns-update run: https://phabricator.wikimedia.org/P74288 [22:13:12] kamila_: looking at what the actual diff is, like so: [22:13:18] [dns1005:/srv/authdns/git] $ git diff --staged [22:13:25] -idp 5M IN CNAME idp2004.wikimedia.org. [22:13:25] +idp 5M IN CNAME idp1004.wikimedia.org. [22:13:35] now let's see who was working on idp.. hmm [22:14:46] back in January.. but something like this https://phabricator.wikimedia.org/rODNSf00bb1e1caf4b0c8efecfa96a4dd663d44f6022c [22:15:15] sukhe: how about editing directly in /srv/authdns/git to revert a staged but not comitted change [22:15:26] thanks a lot mutante <3 [22:15:59] brett: do you know about the idp diff by any chance? [22:17:01] what about it? [22:18:12] brett: hasn't been committed and is breaking authdns-update [22:18:17] on dns1005 there is a change in /srv/authdns/git that is not committed and synced to other DNS servers [22:18:30] hmmmm why is that I wonder [22:18:33] it's about flipping idp1004 and 2004 [22:18:39] moritzm: You know about this? [22:20:17] brett: given it's 23:19 here and there, I don't expect him to answer right now :D [22:20:38] * kamila_ is going to stash it unless somebody stops them [22:22:02] hold on [22:22:17] Any chance it'll break something? [22:22:57] bblack: was this you? [22:23:20] I know you and sukhe were discussing authdns stuff yesterday [22:35:30] catching up [22:36:46] sukhe: root@dns1005:/srv/authdns/git# git diff --staged [22:36:59] Local zone files are NOT in sync with operations/dns.git (SHA: local is 6cd7aaeec73aecd5c403d6fb1332588352b17d4d, dns.git is e3b357f13a60cfef9cb0c2d4c223ddd8ab524924) [22:37:35] interesting, so this has been broken for a while [22:37:44] sukhe: we know the diff.. we just dont know if we should touch that repo at all [22:37:45] yaaaay [22:37:46] this should be paging [22:37:52] oooh.wow [22:37:59] ok going to depool dns1005 [22:38:02] to unblock you [22:38:19] thanks sukhe <3 [22:38:31] kamila_: mutante: please try now, it should be depooled for authdns-update [22:38:34] the deployment server switch is not super urgent btw... but this probably is :D [22:38:36] and I will take care of it later, finishing up dinner [22:38:44] ok, will do, thank you <3 [22:38:47] thank you! [22:40:16] so we have this alert for this reason but it fired a while ago and got lost in the usual IRC noise [22:40:29] will think about a fix that is not paging but perhaps a separate email to traffic or something [22:40:55] that seems like a good idea :D thanks a lot sukhe <3 [22:41:38] (and confirming, authdns-update is happy now) [22:41:38] no worries! sorry about this. we will fix it to alert better [22:43:07] I think email notifications or automatic tickets are the fix for most alerts. just IRC/alerts web UI doesn't alert enough [22:43:50] mutante: yeah. I am of the opinion that this should be paging but I won't go there :P [22:43:54] * kamila_ thinks it'd alert enough if it weren't alerting too much, but we're not solving that anytime soon [22:44:17] anyway we will take care of dns1005 later. it should be all good, it's depooled for everything so it doesn't matter [22:44:19] kamila_: the nice part about your change is that "deployment.eqiad.wmnet" is not a lie anymore and actually eqiad again :) [22:45:38] mutante: happy to be the person who isn't lying, unlike whoever runs the next one :D [22:47:03] We have far too many critical alerts that the stuff that are actually critical are swamped out [22:48:18] I don't need to know that apt's daily upgrade failed on a hadoop cluster [22:49:17] kamila_: I forgot the $reason but there was one because I think we suggested before to name that deployment.svc.wmnet or something like that that is neutral [22:49:38] brett: fully agree. just like on Icinga. replacing the tool did not change that part [22:50:25] mutante: I was suggesting the same more than a year ago and I didn't find a reason not to, just didn't find time either '^^ [22:55:09] kamila_: cool! I thought I had an old abandoned change about this but cant find it now. [22:56:02] I think just changing the name would be a lot easier than also trying to fix the $reason why it's hard-coded in hieradata [22:56:11] which is what I was trying to do at the same time and shouldn't have :D [22:56:18] once suggested to have generic names for bastions but concern there was "We shouldn't implicitly train users to just blindly respond "yes" and TOFU, nor for them to ignore host key validation errors when e.g. we rotate systems." [22:56:22] * kamila_ makes a note and adds it to the notes for next time they have free time :D [22:56:23] which is what we are doing here I guess [22:56:29] true [22:56:37] oh, yes, that was why! [22:56:40] this was from 2019 on https://gerrit.wikimedia.org/r/c/operations/dns/+/489103 [22:56:54] host keys were why $something was unhappy and then I didn't think it was worth it [22:57:08] right [22:57:16] you probably need the actual host names in hiera for rsync and firewalling [22:57:47] yep [22:58:37] alright, it's late there. let's see :) [22:59:00] hey folks, are MW deployments allowed at the moment? [22:59:13] TimStarling: not right now, I'm switching the deployment server [22:59:23] ok, no problem [22:59:25] but I would be happy to offer you a guinea-pig-shaped slot in a bit! [23:00:02] I have https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1129957 ready for scap backport whenever you're ready [23:00:27] I'm not entirely sure what "in a bit means" right now '^^ so I'll ping you once it's looking good and/or deploy it myself [23:00:32] TimStarling: ^ [23:01:48] technically we should have downtimed releases* servers before switching deployment servers because they pull from them [23:02:03] it's resolved now but does create tickets https://phabricator.wikimedia.org/T389570 [23:02:14] which I am closing now [23:02:18] mutante: that appears to be missing in our docs, somebody should do something [23:02:32] sorry about that, I'll do that tomorrow [23:02:51] kamila_: :) I got some of those stickers to LA :) [23:03:00] excellent :D [23:03:16] thanks, it's definitely minor [23:10:06] I'll merge that patch, CI takes about 12 minutes so I may as well get that out of the way [23:11:17] sounds good TimStarling [23:11:57] meanwhile I'm running a test deployment, so I'll be done soon assuming it works :D [23:13:07] The dns repo issue has been fixed, I'll repool [23:13:28] nice [23:14:44] thanks brett! [23:41:23] deployment server switch done :-) [23:43:50] nice! thank you :) [23:45:10] thanks to everyone who worked to make it a thing that didn't blow in my face at my midnight :D and cheers to jasmine_ for shadowing+moral support :-) [23:45:21] *blow up too, and with that, good night :D