[08:01:56] hello, fyi I'm going to depool eqsin in 1h to upgrade the switch stack [08:23:56] <_joe_> ack [09:07:21] how do I downtime the paging monitoring checks towards the LVS VIPs? [09:07:47] for eqsin in my case, I used to do it in icinga, but they're not there anymore [09:11:14] godog: ^ maybe? [09:12:24] XioNoX: easiest is to add a silence for severity=page and site=eqsin [09:14:28] godog: nice, thanks! [09:14:47] I could even just add a silence for site=eqsin [09:15:00] XioNoX: sure np, yes indeed you can [09:15:10] godog: is there a way to see all the relevant alerts? filtering with https://alerts.wikimedia.org/?q=%40cluster%3Dwikimedia.org&q=site%3Deqsin doesn't show anything [09:16:35] XioNoX: yeah that would show alerts that are firing, alerting rules configured in eqsin are shown at https://prometheus-eqsin.wikimedia.org/ops/classic/alerts [09:17:16] cool, thanks! [09:17:32] I need to write it down somewhere [09:18:25] godog: is that expected https://alertmanager-eqiad.wikimedia.org/#/silences/60efd545-d505-4346-94f3-5a57dcb253ec ? [09:18:37] SSL_ERROR_BAD_CERT_DOMAIN [09:19:29] XioNoX: yes, a misleading link in the alerts.w.o UI [09:19:40] i.e. not meant for public consumption [09:23:11] noted, thx! [09:47:30] fyi, I'm going to reboot eqsin's switch stack in ~5min, most of if has been downtimed, and the site is depooled, but there will be some alerts. Downtime should last about 20min [09:49:27] XioNoX: ack, tnx [09:55:07] alright, rebooting in a few seconds [09:58:56] watching the boot sequence on the console [10:05:13] alright, switches up, but components still booting [10:06:20] alright, fully up [10:06:29] so about 10min downtime total [10:11:44] great! [10:12:02] yeah that was smooth [10:13:12] nice work! [10:13:30] testing a few things while the site is depooled then will repool it [10:16:43] sweet, even the mgmt_junos works on that version [10:17:01] \o/ [10:17:34] it doesn't work on the fasw running 21.2, but works here on 21.4 [10:18:20] I'll let it sit for a bit then repool the site [10:19:07] huh... i'm somewhat amazed it was introduced that late (assuming it was added between 21.2 and 21.4) [10:19:46] it was supposed to work before, so maybe it was a bug that got fixed [10:20:00] yeah that might be more likely [10:37:25] eqsin repooled [10:54:30] <_joe_> ack [12:10:07] I just had this during a reimage: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='netbox.discovery.wmnet', port=443): Read timed out. (read timeout=5.0)")': /api/ipam/ip-addresses/?device=db1224&interface=mgmt&limit=0 [12:12:00] marostegui: transient or it failed? [12:12:11] it was right after starting the reimage [12:12:16] then it seems to have proceeded fine [12:12:29] it is the first time I see it, hence reporting it :) [12:13:20] that's an automatic retry of requests (via urllib3's Retry) of a timedout API call to netbox [12:23:58] so it should be harmless AFAICT [15:51:57] kamila_: I think we concurrently submitted puppet changesets :) [15:52:39] apparently [15:52:43] urandom: mine can be merged [15:52:53] ok, merging [15:53:32] thanks [16:34:58] how is https://config-master.wikimedia.org/known_hosts populated? how does a key get from a server, to that file? [16:35:25] urandom: puppetdb i suspect [16:35:25] urandom: the data is pulled from puppetdb [16:35:52] taavi: yeah, I meant how does it get into puppetdb [16:36:21] urandom: exported resources [16:36:26] I guess there is a scheduled job somewhere in that path, a key I was waiting for just showed up... [16:36:35] <_joe_> urandom: puppet reports facts and exported resources when the agent runs [16:36:37] puppet needs to run [16:36:44] no I think it's a fact and not an exported resource [16:36:49] <_joe_> and the puppet master saves the data to puppetdb [16:36:49] on the source host firt and then on the target [16:36:56] <_joe_> taavi: I think it's a fact now, yes [16:36:57] it might be a puppetdb query nowadays though [16:37:38] oh, no, it's an exported resource. see ssh::known_hosts in puppet [16:37:54] oh, so it was waiting for puppet to run on the puppet server? [16:38:14] on the config-master ones [16:38:22] gotcha, ok [16:38:59] but during a reimage the cookbook forces a puppet run there [16:39:07] on O:config_master [16:39:28] auh, yes, I think I've seen logged output to that effect when reimaging [16:46:03] who owns the vrts / ticket.wikimedia.org infrastructure? I need to add a suitable tag to T354484 [16:48:39] Emperor: collaboration services does, #vrts should work [16:50:53] thanks :) [16:58:51] thanks, we just processed 2 other tickets from the same person [16:59:24] tag is "Znuny" [16:59:42] and the one you added, ack [17:00:16] mutante: should Znuny get added to https://phabricator.wikimedia.org/project/manage/1025/ and the corresponding clinic duty query? [17:00:51] what's the difference between the #vrts and #znuny tags? [17:00:57] oops, thanks mutante [17:01:28] Emperor: just the tag? sure, I made an edit right now [17:02:02] taavi: about 2 years now ;-) [17:02:16] rzl: nothing was wrong :) all good [17:02:45] mutante: thanks, that should save future clinicians some hassle [17:02:53] taavi: none, really, it should probably redirect [19:54:19] taavi: out of curiosity, what does "make puppet re-generate" mean at https://sal.toolforge.org/log/g3CL6owBhuQtenzveVYJ? [19:54:24] a puppet agent run? [19:55:38] sukhe: I added a comment to one of the files in /etc/envoy/clusters.d so that the next puppet run would see a change and trigger the config generation script [19:57:39] hmm ok https://puppetboard.wikimedia.org/report/testreduce1002.eqiad.wmnet/f2cb3bb545b3836bf5860779c436c0502e2b914e