[07:31:27] <_joe_> heads up: about to enable the new requestctl stuff, which includes adding some amount of new VCL on the varnishes [07:31:41] <_joe_> this is the only vaguely dangerous passage of the process [07:32:01] <_joe_> but I asked valentin to review the changes on friday and they seemed ok to him too [07:43:47] what did I mess? [07:43:52] Oh that :) [07:56:03] ack [08:39:50] <_joe_> Emperor / godog - there seem to be quite a few ms-be/ms-fe hosts where puppet has changes on every run [08:40:00] <_joe_> usually on backends this means broken disks [08:40:17] <_joe_> I see 5 hosts though, which doesn't seem right [08:43:17] _joe_: I belive there is maintenance/upgrade ongoing [08:44:53] <_joe_> jynus: yeah I figured, I'd want it confirmed though and possibly addressed [08:45:06] <_joe_> trying to mount an inexistent disk at every puppet run doesn't seem right. [08:45:24] in our meetings they expressed having challenges on reimage, with slow recovery [08:45:51] e.g. on reimage the disks were detected in the wrong order/failed hw [08:49:05] probably worth acking the alerts then [08:52:41] +1 [09:01:44] _joe_: ack, thanks, yeah as jynus said some if not all of those hosts are new hardware, I'll defer to Emperor [09:01:58] * godog errand, bbiab [09:06:50] <_joe_> XioNoX: it's impossible [09:06:54] <_joe_> it's an aggregated alert [09:06:58] <_joe_> for puppet changes [10:36:55] _joe_: I meant all the red ms-be alerts [13:26:05] Is the SRE meeting doc ready for updates? Can I update the date? [15:04:07] Anyone looking at those "stale textfile" alerts ( https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale ) ? Just wondering since it seems to be firing for several disparate services [15:09:38] looking at the cloud* hosts, thanks [15:14:10] I did not see them, what cloud hosts? [15:14:25] (maybe the alerts are gone already xd) [15:14:48] already fixed [15:15:33] one was a toolsdb server that was upgraded in-place buster->bullseye, had a leftover file that was generated by stock prometheus-node-exporter on buster but not on bullseye [15:15:58] and the another was cloudcontrol2001-dev, which had remains from testing the wrapper around prometheus-openstack-exporter [15:18:02] interesting, I guess I have to do some work to figure out elastic1075 then . Was hoping someone else would do my work for me ;P [15:23:29] thanks taavi! [17:13:38] <_joe_> for people who're too young to have got the reference: the "demo time" slide had an imagine referencing the most famous demo fail of microsoft [17:14:20] <_joe_> when Bill Gates went on stage to show the magic of plug and pray connecting a scanner to a computer with the new shiny windows 98 [17:14:27] <_joe_> and that caused a blue screen of death [17:14:35] https://www.youtube.com/watch?v=IW7Rqwwth84 [17:14:53] <_joe_> also, I envy you :P [17:39:57] Nice deep cut [19:47:24] <_joe_> inflatador: it was intended to be self-deprecating also because I was mostly winging the demo [19:48:05] <_joe_> we finished implementing the features I've presented today at the end of last week, so I had just a bunch of hours to prepare for it [19:48:37] You're either very good, very lucky, or both ;P [19:48:40] <_joe_> (and indeed, there were a few hiccups) [19:48:52] (probably both) [19:49:15] <_joe_> I am surely comfortable improvising a talk in front of people, which is a big advantage. [19:50:42] So what you're saying is, you're better than Bill Gates ;P . I agree! [19:51:06] <_joe_> well no, just that requestctl is better than windows 98 :P [19:51:25] <_joe_> which, you must admit, is an incredibly low bar [19:52:08] "requestctl: Better than Windows 98" [19:52:35] I'd make a PR to put that in the repo somewhere, but Legal probably wouldn't like it ;( [19:53:21] <_joe_> yeah i prefer to shitpost only on my personal projects [19:54:03] <_joe_> or, you know, you can blur the lines, like https://github.com/cdanis/tunnelencabulator [19:56:58] nice! reminds me of https://botwiki.org/bot/mark-v-shaney/ [20:10:19] :) [20:32:47] volans: to get a service IP: I would "reserve" an IP in netbox first, by clicking "Add an IP address" in the relevant prefix, add a comment about "Keep manual DNS.." and then actually take it in operations/dns with a traditional procedure like before netbox. It would be just like what was recently done for new gitlab machines. still correct? [20:36:55] mutante: tl;dr yes, it's all documented in https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox [20:37:15] you need to run the dns.netbox cookbook too anyway also if it's a noop in prod [20:37:55] volans: ah, thank you. last time I did something manual in DNS repo and ran it it _seemed_ like the sync was never _to_ netbox. thanks for pointing that out [21:20:32] for the record, added in netbox, including entering the DNS name in netbox, ran the cookbook.. and my new name already resolves forward and reverse. everything works and the docs don't say I need to manually make a change in operations/dns (anymore/at all?) [21:20:36] mutante: ah but you meant a different kind of service [21:20:39] not an svc IP [21:20:51] I wanted what I think is "role: secondary" [21:21:02] so I picked that [21:21:11] not behind LVS [21:21:22] so the netmask should not be /32 but the same of the prefix [21:21:31] role should be none [21:21:41] volans: do I need to make the change in operations/dns ? [21:21:46] no in netbox [21:21:47] since it's like it already works [21:21:57] you followed the steps for an svc IP [21:22:02] but this is not an svc one [21:22:23] also, historically, the gerrit ones were managed manually because of the TTL, the ones generated by netbox have a 1H TTL [21:22:41] I noticed the TTL part because jenkins just downvoted me for non-matching TTLs [21:22:51] ACK, only needs manual change for .svc. IPs [21:23:12] the previous records have type: VIP though even though they are not VIPs [21:23:14] let me fix netbox, then you can re-run the dns.netbox cookbook and then recheck your CR [21:24:03] I did enter the actual DNS name in netbox (unlike the gitlab stuff done recently) [21:25:02] ok, thanks. so you are changing: netmask and role [21:25:20] role: none? Ok, I wasn't sure. it's not secondary either? [21:25:53] ok, I've changed netbox: https://netbox.wikimedia.org/extras/changelog/85615/ and https://netbox.wikimedia.org/extras/changelog/85616/ [21:26:04] you can re-run the dns.netbox cookbook that will remove those generated records [21:26:20] and then you can reply recheck on your CR on the ops/dns repo [21:26:20] I did manually get the mappped IPv6. ok, thanks. on it [21:26:40] to handle this manually, because of TTL and because it's a unicorn [21:27:06] becuase uses an IP from a host ip subnet as a service IP, and in theory it shouldn't but it's like that for a number of reasons [21:27:09] I would have abandoned the manual change now because you said it's only needed for svc records. [21:27:35] should I not even enter DNS names in netbox in that case? and match gitlab-replica ? [21:27:46] I've already fixed netbox [21:28:30] mutante: how have you chosen the IP? [21:28:41] that's in the wrong subnet actually [21:28:53] gerrit2002 is in rowB [21:29:09] the IP you chose is in public1-d-codfw [21:29:56] gerrit2002 is supposed to be next to gerrit2001 [21:30:09] codfw row B / B5 [21:30:12] that's how I picked it. if the host is in a different row..then..yea [21:30:24] vs codfw row D / D5 [21:30:24] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [21:30:31] they are in different rows [21:30:41] and the IP will be able to migrate between hosts in the same row [21:30:54] so if you plan to have that migrate from one host to another... it will not work [21:31:27] ack, this means gerrit2002 is in the wrong place [21:32:02] depends if you want gerrit to have row-redundancy or not ;) [21:33:56] it's supposed to replace 2001. it's not in addition to it. there will still be just one per DC [21:34:21] the cookbook finished. I think we can just delete all of this. [21:35:40] do I just click the delete button in netbox and then run the cookbook again then? [21:47:17] !log ganeti4002 rebooting for firmware update via T307997 [21:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:23] T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 [21:49:14] I deleted them from netbox and running the cookbook again. the process I followed does not say it's only for svc records. this is also not a VIP though so it's ..special among special IPs [21:49:42] there was nothing to commit