[06:48:46] hi folks [06:49:03] icinga shows ms-fe1012 as down/unpingable but I can ssh, and I can ping it from alert1001 [06:49:42] it was reimaged yesterday, so maybe something is up [06:49:47] (EU time I mean) [06:51:10] elukey@ms-fe1012:~$ ping ms-fe1007.eqiad.wmnet [06:51:10] ping: ms-fe1007.eqiad.wmnet: Temporary failure in name resolution [06:53:22] something is weird though, I see in site.pp [06:53:22] node /ms-fe10(0[9]|1[0-2]).eqiad.wmnet/ { role(insetup) [06:53:23] } [06:53:30] but they have role swift::proxy [06:53:52] ah node /^ms-fe1\d\d\d\.eqiad\.wmnet$/ { [07:00:13] ok so the git blame says "Hosts are safe to have their role applied at any time" that is good [07:09:41] cannot connect to lsw1-e1, but the host seems in row E [07:10:52] the other nodes are in row a,b,c, maybe something is up with the config on the new switches [07:13:46] Emperor: o/ [07:24:44] "Configured puppet masters send there facts to the puppet compiler db host (pcc-db1001.puppet-diffs.eqiad1.wikimedia.cloud) using the upload_puppet_facts systemd timer. " [07:24:47] wow [07:27:09] brb [07:46:09] created https://gerrit.wikimedia.org/r/c/operations/puppet/+/769382 [07:46:24] pcc https://puppet-compiler.wmflabs.org/pcc-worker1001/34151/ [08:01:08] if anybody wants to double check --^ [08:05:12] <_joe_> elukey: 5 mins [08:06:47] even more, the hosts are not taking traffic afaics, so it is fine [08:55:13] elukey: hi [08:55:18] hi :) [08:55:58] I'm still waiting for ms-fe2012 handover from DC team (let me find the ticket) [08:57:06] Emperor: yeah the ticket was closed yesterday and now the host has the prod role [08:57:17] right, yes, sorry, still catchup up on overnight email https://phabricator.wikimedia.org/T294137 [08:57:19] but it seems that it has a networking issue, this is why I was asking [08:57:45] elukey: sadness. From my POV, it's not yet in service [08:58:34] * Emperor looks at your patch [09:00:19] +1 from me, thanks - can you either repoen T294137 or make a new issue to track the networking issue, please? [09:00:20] T294137: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 [09:01:07] definitely yes, I was doing it [09:01:12] going to merge and the update the task [09:01:32] Brilliant, thanks :-) [09:03:32] np :) [09:05:53] Emperor: https://phabricator.wikimedia.org/T294137#7763185 [13:48:35] jbond: re https://gerrit.wikimedia.org/r/c/operations/puppet/+/769410, is it going to run on cp servers, right? [13:49:32] depends, see current discussion in the other channel ;) [13:49:56] if the data is useful elsewhere we might go for a repo available everywhere [13:50:21] right, if we go for the cp servers.. http_proxy support is mandatory :) [13:51:07] that's needed pretty much anywhere :) [13:51:28] vgutierrez: yes i have added a comment to add http_proxy support just not got to it [13:52:08] if you export HTTP_PROXY/HTTPS_PROXY it just works [13:52:13] so can be done also at the puppet level [13:52:17] up to you [13:52:22] see https://2.python-requests.org/en/master/user/advanced/#proxies [13:54:46] yep [13:54:56] you can inject it on the systemd unit itself as env variables [14:00:31] adding to the systemd::timer job sgtm will update [14:02:32] elukey: thanks (I get emailed about that ticket); I wonder if topr.nks thinks we should go ahead now or let them do more checking first (I'll ask there) [14:03:50] Emperor: to put that host in production? [14:05:22] well back into swift::proxy role first (and with a view to actually-in-production later, maybe Monday) [14:06:42] Emperor: that's fine, please ping him or me before the actually-in-production, so we can have a closer look [14:07:16] this kind of "something gets stuck" often never show up again [14:16:28] Emperor: +1 to put the host in production [14:16:48] maybe we can keep it as inactive for this week [14:16:59] before pooling real traffic [14:17:21] (IIUC all the ms-fe nodes are behind a LVS VIP, and the new ones are still marked as inactive) [14:17:32] elukey: correct [14:17:54] elukey or XioNoX fancy a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/769443 ? :) [14:18:02] already done :) [14:18:15] wow, you're quick :) [14:19:10] is the new host in row e/f? [14:19:15] yes [14:19:19] volans: yes row e [14:20:13] then I'd honestly wait [14:20:47] the racks are still in planned state in netbox, the switches in staged and are skipped by the daily homer run and last I checked their current config is not yet merged into homer [14:21:14] topr.anks was working on that but has been sick the last few days [14:21:46] volans: switch changes are still half manual but not a blocker to start provisioning servers (and unblocking people) [14:21:47] volans: their last comment on https://phabricator.wikimedia.org/T294137 suggests to me they think it's OK now [14:22:03] in any case, I'm not going to push prod traffic to this node until at least Monday [14:22:09] XioNoX: ok to provision, not sure about active alerts (paging ones) and live traffic [14:22:31] I'd like to avoid unnecessary pages if possible ;) [14:22:57] agreed, I asked above to ping netops before having actually-in-production servers [14:23:00] volans: we are going to keep the ms-fe node as inactive for some days, just to double check that the ARP weirdness will not re-happen [14:23:22] so it will get the prod role but not traffic [14:23:33] ack [14:23:42] I'll set downtime for these new nodes until Monday [14:23:43] it seems a good compromise for the moment, at least we have some testing [14:23:59] +1 [14:24:23] Emperor: I'd propose to keep monitoring up, it will be easier to spot anomalies, they don't really page [14:24:34] OK, that's one less thing for me to do :) [14:24:39] yes at most downtime those that page [16:57:29] hi dear SRE, may one please puppet-merge a Gerrit email template change for me please? There is a variable which no more exist in the Gerrit version we are running https://gerrit.wikimedia.org/r/c/operations/puppet/+/768005 [16:58:28] that would remove a `null` string from plaintext notifications emails [17:03:21] hashar: done [17:07:13] akosiaris: awesome! thank you for making the Gerrit emails slightly nicer \o/ [20:34:50] jbond: TIL replace=>false! thanks, I've learned so many things about Puppet from your code [20:36:05] glad to see I am not the only one who greps for jbond's Puppet code :) [20:37:10] :) thanks <3 [20:38:09] ccccccktdtgklrnjlchbcdregjeigdgrcenrttekhjvt [20:38:48] * jbond that one was actully yubikey caused by cat [20:40:43] jbond: that was your cat's +1 to what we said [20:41:46] lol [20:42:42] sorry jbond I just thought of *another* thing with the original patch, not that I think we need to block on it just have a TODO to fix eventually [20:43:49] cdanis: no problem just add what ever comments and will get to them tomorrow [20:44:20] not a big deal for a MVP [20:44:31] ack [21:07:40] btw jbond I did the one fix that was necessary (a chmod) and the script now runs fine on sretest1002 [21:08:38] ahh thanks cdanis :) [21:08:55] I'll send a follow-up patch for that one myself [21:09:30] <_joe_> jbond: btw, I think we can get the script to integrate with the etcd stuff I was doing quite easily. [21:09:38] <_joe_> let's talk tomorrow :) [21:10:50] _joe_: i agree i thik this patch is the best one https://gerrit.wikimedia.org/r/c/operations/puppet/+/769511/2 then we can just create rules that check that header value but need more feedback from from traffic etc first [21:11:02] ah you already did it [21:11:24] cdanis: yes i think i missed you from that one at first only added you a few mins ago [21:11:31] <_joe_> jbond: yeah I was thinking of also centralizing downloading that file [21:11:42] <_joe_> and storing the data on etcd, for a couple reasons [21:11:55] yes thats something vola.ns also mentioned [21:11:57] <_joe_> but again, I'd prefer talking about it in the morning [21:12:07] yes sure thing lets chat then [21:12:48] yeah +1 on the synthesizing a pseudo-header idea as well [21:13:03] <_joe_> i'd love for that header to reach the backends too [21:13:04] * jbond hopes joe realizes i rise later then him :/ [21:13:09] <_joe_> ahah yes [21:13:14] :) [21:13:50] expect to be pinged around 7am :D [21:14:52] lol [21:15:16] <_joe_> nah I will wait for 8 am, I have some mercy [21:17:37] hehehe