[08:19:49] cdanis: thanks for making a gcal thing for the oncall shifts :) How often does it update? I swapped a couple of shifts round just now, and wondering how long I should expect it to take for them to be reflected in your ics [08:22:14] Emperor: IIRC few hours up to ~24h because of some not respected headers/config to refresh it more often. But wait for Chris for the authoritative answer ;) [08:26:20] thanks :) [13:32:00] Emperor: the ICS file in my homedir is updated every hour; however, Google scrapes it approx 2x/day at random times and I seem to have no control over that [13:33:04] cdanis: thanks, useful to know [13:34:04] I was close enough :D [13:37:13] there is even a "requested refresh interval" metadata field in the ICS spec, which I set to 4h, but it doesn't seem to have an effect [13:37:36] (and similarly I set the cache-control header, also not respected) [13:48:21] moritzm: could you please update .users for pwstore? I've added my PGP key to keys/ [13:55:40] dhinus: moritz is out, you should ask mutante when he's online (most likely later due to his TZ) [13:55:53] thanks, will do! [14:24:13] _joe_: want me to merge 'add cgoubert to ldap_only_users' ? [14:24:30] <_joe_> andrewbogott: thanks, I was explaining to claime hwo that works :P [14:24:47] I can back out if you want to use it as a lesson [14:24:54] Thanks andrewbogott :D [14:24:55] <_joe_> claime: so now andrew is on a puppet master server, and is running the "puppet-merge" command [14:25:07] <_joe_> andrewbogott: he doesn't have access still to production servers [14:25:09] <_joe_> don't ask [14:25:16] ah, ok, I'll merge then [14:25:20] <_joe_> yep [14:25:24] done [14:25:42] I suppose it's a custom script/alias that merges into your puppetmaster code ? [14:29:33] claime: yes, across multiple puppetmasters. We don't automatically pull the latest from gerrit because we want one more security layer between gerrit and actual prod deployment [14:29:50] (that's largely not the case for other gerrit repos but puppet is extra scary) [14:30:23] andrewbogott: Yeah, makes sense because of the splash zone and gerrit's attack surface [14:30:33] yeah [14:34:19] claime: it's actually a .sh script wrapping a .py script for no good reason :) [14:34:42] if you are morbidly curious, https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/puppetmaster/files/puppet-merge.sh and https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/puppetmaster/files/puppet-merge.py [14:36:56] cdanis: No good reason or no known good reason? :D [14:37:13] important but not urgent :) [14:37:28] tech debt [14:37:58] A pattern emerges :p [14:40:27] (a normal one to be fair, if it ain't broken...) [15:07:43] * jbond is remembres Gerrit:544943 is bitrotting [16:10:07] <_joe_> claime: https://wikitech.wikimedia.org/wiki/Puppet#Making_changes [16:10:14] <_joe_> this is how stuff is distributed [16:10:33] <_joe_> except the server names are all from a distant past [16:10:37] <_joe_> hello, palladium [16:12:22] I'm guessing the late palladium is the deployment server, and then spread over to other puppetmasters for LB? [16:17:16] oh the puppetmaster arch is described later in the page [16:17:18] cool [16:40:27] palladium once was the puppet master and became puppetmaster1001. the deployment server (for mediawiki and other things) was "tin" and then became deploy1001 [16:57:16] Metals? Transitions? There's a chemistry joke in there somewhere [16:59:21] inflatador: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions#Miscellaneous_servers [16:59:30] mostly sunset by now ;) [17:00:29] volans nice. I already confused one of my co-workers by naming a local VM "osmium" ;P . I guess it's a pretty common theme for servers [17:01:59] chemical elements just did not have enough members for us and reusing names in prod sucks every time for someone:) [17:02:55] well, and "palladium is the puppetmaster" is harder to remember than "puppetmaster1001 is the puppetmaster" [17:02:58] star names in codfw did not have that limitation but almost nothing is "misc" anymore now. everything became a 'cluster' [17:03:20] we have enough stuff to keep track of without inventing new layers of indirection for ourselves [17:03:33] "Hello, large Hadron collider? We're running out of names for our servers, can you bombard some stuff and make new elements for us? Kthz bye" [17:04:04] :) reminds me when I picked "ununpentium" [17:04:14] element 115 [17:05:40] Nice [17:08:25] https://wikitech.wikimedia.org/wiki/Category:Servers - these all use Template:Server to create a page for the host. ex: https://wikitech.wikimedia.org/w/index.php?title=Bast4003&action=edit it's not as common anymore but can be nice for hosts with a lot of shell users outside root [17:52:18] puppet fails on alert* servers and alert servers are alerting about that. the reason is "Unknown resource type: 'monitoring::alerts::traffic_drop'" [17:52:31] brett: ^ [17:52:39] (smells like https://gerrit.wikimedia.org/r/814894, sorry if that's a misdirect) [17:53:12] /modules/profile/manifests/prometheus/alerts.pp, line: 195, [17:53:59] noooooo [17:54:11] But jenkins was happy :( [17:55:50] fwiw another good tool is PCC, https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler which simulates actually running puppet on some host -- it doesn't automatically run on every change, because it's not clear what hosts it should run on [17:56:16] that would probably have saved you here, but definitely isn't your fault that jenkins lulled you into a false sense of security [17:56:34] It seems like an intermediary step of setting ensure to absent was needed [17:57:02] Given that it's already pushed, what's the best way to remove the alerts? [17:57:34] I haven't dug into the change -- is it something you can revert, deploy that, and try again? [17:58:05] so first of all, don't worry, it's not like icinga is down [17:58:20] because puppet won't restart it and the config has no errors [17:58:45] it's just puppet run so it only becomes a problem after some time [17:58:47] oh, yes :) I definitely should have said that first! sorry for worrying you extra [17:58:57] I appreciate it :) [17:59:11] now what you can do is ACK the alerts [17:59:28] this one: +jinxer-wm> (PuppetFailure) firing: (2) Puppet has failed on alerting hosts [17:59:34] (I'm stepping back, mutante's got you from here, but happy to help if you need anything) [17:59:40] thanks a lot, rzl [18:00:02] https://gerrit.wikimedia.org/r/817844 should fix the failure [18:00:03] but that's not icinga-wm, it's jinxer-wm, so that's newer type of monitoring [18:00:27] zabe: this is as much a training opportunity as it is anything else ;) but thank you for being on top of it! [18:00:39] ah, sorry [18:00:49] not at all <3 [18:00:56] removing all the occurences of that check across the code is good [18:01:44] Okay, so what is the best way to ack the alert? jinxer, not icinga, right? [18:02:02] for icinga changes specifically I would recommend: merge in puppetmaster, run puppet on alert1001, run 'sudo icinga -v /etc/icinga/icinga.cfg" if icinga config was changed [18:02:19] that last one is just checking config for syntax errors/warnings [18:02:43] More generally, it sounds like final checks of configs is advisable, e.g. if nginx changes I'd run nginx -t on the effected hosts [18:02:57] brett: yea, so I think this means go to alerts.wikimedia.org and "silence this alert" [18:03:15] or the downtime cookbook can do it for both icinga and alertmanager I think [18:04:21] though the terminology isn't the same in Icinga and alertmanager. before we had separate "scheduled downtime", ACK of ongoing alert and "disable notifications". now we have a simpler model of "either it's silent or not". but observability might correct me there [18:06:05] merged in the changes and just run puppet merge, should hopefully be good now but I'll verify ;( [18:06:22] brett: https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements [18:06:31] brett: :) [18:06:41] * brett opens the tab for later reading [18:07:34] yea, use 'run-puppet-agent' wrapper on alert1001 [18:08:09] ugh, more to remove [18:08:13] yikes, well [18:08:24] I'm learning, that's for sure [18:10:31] broken puppet runs that are fixed within hours are not a big deal. it only becomes one after a couple days [18:10:53] Yeah, glad I didn't take icinga down or something [18:11:11] the good news is, you'd probably have to be trying [18:11:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/817866 [18:12:05] in a distant past puppet would always restart the icinga service and one syntax error (usually a new user getting onboarded as icinga contact with a typo or something) would take down Icinga. but then we fixed that part and it won't do it if that config check fails..also in this case it did not break actual icinga config [18:12:16] nice! [18:12:27] I guess I should use the puppet compiler rather than submit one change then another [18:12:59] yea, puppet compiler is a good habit. there are some niche cases even that won't catch, but almost all [18:14:23] or you could also do something like "disable puppet on alert* via cumin", "merge patch", "enable puppet on just one host (an inactive one for example)", check, re-enable puppet on all. if it affects many hosts [18:17:26] Okay. Failing pcc, so I'm going to work on getting that running successfully and will then submit a more cohesive change [18:18:34] sounds good! [18:21:13] Thanks so much for the help, mutante [18:23:22] you're welcome [18:35:36] When I'm adding an existing admin group to a new host that just replacing an hold host, that is not an access request at all and does not need any reviews. Would you agree? So if you are not changing who is in a group and also not what a group can do, only on which machines the group is applied. [18:36:59] mutante: that sounds correct to me [18:38:08] thanks, ok [19:29:46] topranks: A question for you when you have a moment: https://phabricator.wikimedia.org/T313977 [19:47:44] andrewbogott: I’ve no objection in principal, replied to the task there. [19:48:39] Certainly the /29 that’s already reserved for WMCS should not be an issue I think. [19:49:29] thank you! And yeah, we hope that that /29 is enough [19:49:48] (honestly one more might be enough) [19:58:00] Ok cool. Well we can set that up for now I guess, can you respond on the task and let me know where it needs to be routed? [20:00:26] topranks: yes, or at least can make a suggestion by analogy :) [20:43:11] I have ACKed all alerts for now [20:43:16] same here [20:43:22] this is the middle of the deployment [20:43:25] don't think it was the deploy fwiw [20:43:28] but have they finished it? [20:43:38] (or at least, the patches listed..) [20:43:46] I am on one of the servers, randomly picked [20:43:50] mutante: I believe 2 patches were on debug [20:43:57] and I can get on it and php-fpm and apache are running [20:43:58] But none pushed further [20:44:06] cjming: can you confirm status? [20:44:11] also wiki works for me [20:44:13] cjming: ^ please confirm? [20:44:13] Had anything gone past debug? [20:44:20] what status? [20:44:38] cjming: of your deployment [20:44:43] was this in the middle of the deployment? [20:45:00] yes - i was merging several patches [20:45:05] only sync'd one [20:45:33] cjming: which got synced? [20:46:06] actually i sync'd 2 - 1st was https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/817373 [20:46:17] 2nd was https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/817893 [20:46:40] so we need to revert this then given the timing matches? [20:47:22] sukhe: I'm not sure. That's a good 15 minutes before deploy. I can't see a slow drop. It's a very quick change in workers available. [20:48:36] sukhe: does php's slow log give any indication [20:48:42] checking [20:48:56] Very little I can say because I have no real access [20:49:20] sukhe: those don't seem particularly risky to cause this kind of breakage, so I'd hold off reverting for now [20:49:36] yeah thanks I am not doing anything without making sure that there is consensus [20:50:43] marostegui: could this be you? [20:50:47] I don't think it was the deployment [20:50:49] Given your phab comment [20:51:00] RhinosF1: Not me no, mariadb [20:52:15] marostegui: thanks [20:52:21] sukhe: https://phabricator.wikimedia.org/T313986 got created [20:52:40] thanks RhinosF1 <3 [20:52:44] RhinosF1: you are mixing lots of things here [20:52:51] and it is too late for me to discuss anyways [20:53:14] marostegui: appreciate you coming out and helping thanks <3 [20:53:19] we will take it from here [20:53:28] (clearly, what you suggested worked since we didn't do anything else) [20:53:56] sukhe: no problem, see the other channel for what I think it was the issue [20:54:02] yep [20:55:53] we will keep investigating with mariadb why 10.6 hosts suffer that much during the cache busting attacks [20:55:54] Im off to sleep