[06:03:02] good morning folks [06:03:25] jbond: o/ after https://gerrit.wikimedia.org/r/c/operations/puppet/+/730852 it seems that all dns* hosts don't have systemd-timesync's config anymore [06:03:42] (and its wmf-restart unit complains about it) [06:04:50] ntp on them seems up and running, maybe they just need to point their timesync's config to localhost? Or is there a reason to not have it on those nodes? [06:09:21] ah yes the role has profile::systemd::timesyncd::ensure: absent [06:09:34] (dnsbox) [06:12:14] https://gerrit.wikimedia.org/r/c/operations/puppet/+/73187 should fix it [06:14:22] I'll wait for a sanity check before merging :) [06:17:04] elukey: your link is missing a 3 at the end - https://gerrit.wikimedia.org/r/c/operations/puppet/+/731873 [06:17:58] elukey: same as https://gerrit.wikimedia.org/r/c/operations/puppet/+/731843 :) [06:19:07] ah! Will abandon mine then :) [07:16:48] elukey: thanks, the error should be benign, ill look at the fix and merge in a sec thanks [07:46:33] good morning, I have a puppet compiler host selection question. I would like to compile profile::ci::docker for both production and WMCS instance that could use it [07:47:00] as I got it P:ci::docker would only select a single host from the production pool, and none from WMCS. So I am wondering how I can select the WMCS instances ;) [07:48:15] hashar: WMCS doesn't have puppetdb so you have to use hostname-based selection for those [07:48:49] awesome! thank you ;) [07:48:53] AFAIK https://openstack-browser.toolforge.org/puppetclass/ is not supported by PCC and I don't see any host with profile::ci::docker there so you have to pick which one are [07:48:59] based on what includes that profile [07:49:33] I will use the fqdn of one of the instances [07:49:56] you also need that the facts for those hosts were exported to the puppet compiler hosts [07:50:10] https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs#How_to_update_the_facts_for_cloud_VMs?_(e.g._INFO:_Unable_to_find_facts_for_host_util-abogott-stretch.testlabs.eqiad.wmflabs,_skipping) [07:50:38] yup they should be there hopefully [07:51:43] and I have another question, the role `gerrit` is applied to two different machines and there is some logic that cause one to be primary and the other an hot spare. `O:gerrit` would only compile for one and not the other [07:52:02] so I am wondering whether the Hosts: syntax can be abused to get the pppc to compile for all hosts matching [07:52:08] rather than hand picking a subset [07:52:28] (or I could just use the fqdn of each hosts and get exactly what I want) [08:08:08] [ 2021-10-19T08:07:37 ] CRITICAL: Build run failed: Exceeded 30 redirects. [08:08:09] not yet ;) [08:25:51] solved [08:25:57] volans: thank you for the direct link! [08:28:14] yw :) [09:40:34] Any thoughts on why a reimage gets stuck on https://phabricator.wikimedia.org/P17518 and then after a while it obviously timeouts and then the host reboots [09:41:03] marostegui: a) blame volans. b) what happens if you hit return at that prompt? [09:41:12] kormat: a) done already b) nothing [09:42:07] that _sounds_ like a problem with the installservers [09:42:21] marostegui: a) not directly [09:42:35] let me check the logs [09:42:49] volans: on my mind yes! [09:43:02] volans: where are the logs? I didn't see the path on the reimage output [09:43:44] <_joe_> kormat: I'm disappointed, I expected it to be: a) blame volans b) blame volans c) blame netbox [09:44:02] _joe_: good point. we should document that. [09:44:08] like any other cookbook, /var/log/spicerack/sre/hosts/reimage* https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Logs [09:44:20] volans: ah thanks :* [09:44:34] marostegui: so the cookbook is not seeing hte host coming back [09:44:40] have you tried to connect to the console via mgmt? [09:44:46] <_joe_> kormat: I think this needs a template [09:44:57] volans: yes, that's why I am seeing it is gettig stuck on the start of the debian installer [09:44:57] volans: that's.. what the paste above is [09:45:16] it doesn't say debian installer :D [09:45:31] volans: But it says buster :) [09:45:37] and the host is stretch [09:45:52] I am thinking if it could be a similar case of https://phabricator.wikimedia.org/T216240 [09:46:03] But in a different flavour [09:46:20] marostegui: no [09:46:27] you passed buster [09:46:28] db2112.codfw.wmnet with OS buster [09:46:34] as --os [09:46:40] volans: correct, cause I want the host to become buster [09:46:50] ok then I don't get what you were saying :D [09:46:59] * kormat tries to figure out if volans is just trolling [09:47:16] marostegui: do you have a d-i menu in front of you? [09:47:18] volans: So the host is stretch and I want it to be buster, and the installation gets stuck at the start of the debian buster installer [09:47:20] I can't install_console to it [09:47:29] so maybe it doesn't have connection [09:47:38] volans: no, the host has timedout and booted up from disk again [09:47:41] ok, definitely trolling at this point. [09:47:48] sad_trombone [09:48:02] volans: but the host doesn't go past https://phabricator.wikimedia.org/P17518 [09:48:36] And nothing happens when I try to hit enter or edit the boot menu [09:48:38] sure but if it's not there I can't debug it :D [09:48:59] and that's why I asked for ideas :) [09:49:08] if you want to get it again in that state I can have a look [09:49:20] volans: sure, I can force it to PXE again [09:49:26] volans: do you want me to get out from the mgmt? [09:49:41] if you manually force PXE will not work as there will be no DHCP [09:49:51] you need the cookbook [09:49:58] uh? [09:50:10] Ah, ok [09:50:12] I get it [09:50:14] sure, let me try [09:50:31] volans: do you want to get on the mgmt to see it booting? [09:50:41] why not give me a sec [09:50:45] db2112.codfw.wmnet correct? [09:50:46] ok, let me get out [09:50:49] yep [09:51:07] console: Serial Device 2 is currently in use [09:51:12] try again [09:51:41] I'm in [09:51:47] ok, let me issue the reimage [09:52:17] thx [09:52:20] there you go! [09:55:58] marostegui: so I can ping it and I saw DHCP was ok [09:56:02] it's loading the kernel now [09:56:17] so might be a firmware issue indeed [09:56:47] volans: yeah, DHCP and all that went well, it is the loading initrd what gets stuck most of the time [09:57:09] in that case blame moritzm :D [09:57:17] XDDDD [09:57:30] wait the screen just blanked [09:57:43] That might be the reboot loop [09:58:43] not pinging anymore [09:58:48] yep might be [10:00:51] marostegui: I'm waiting to see the new reboot, just blank screen so far [10:01:11] it takes a while yeah [10:01:24] screensaver kicked in. wiggle the mouse [10:01:38] do you want to try to reimage into stretch or bullseye to see if that works or directly go looking for firmware? [10:02:33] I can try to reimage to stretch yeah [10:02:53] let's wait for it to fail and then I will try stretch [10:03:04] ack I'll pping you if Isee the reboot [10:03:08] still blank so far [10:03:41] yeah, it took a while for it to timeout and then get the disk boot [10:11:40] does the server have up to date firmware? do we have another server with the same model, either on stretch or buster? [10:12:00] moritzm: I don't think it has up to date firmware no [10:12:34] I have created https://phabricator.wikimedia.org/T216240 for papaul [10:12:43] (i'd be surprised if _any_ db hosts had up-to-date f/w) [10:13:41] we have had the case with a Broadcom NIC that the more recent driver in 5.10 required updated firmware, we might be running into a similar issue here (but with 4.19) [10:13:48] https://phabricator.wikimedia.org/T286722 [10:14:48] They are both R440, so it might be the case [10:16:12] for https://phabricator.wikimedia.org/T286722 it also depended on the time the server was bought: [10:16:38] thanos-fe2001 (where we first saw this) was purchased in April 2020 [10:17:14] moritzm: db2112 in april 2019 [10:18:40] but copernicium (same server model, same NIC) but bought in May 2021 didn't, so with the more recent default firmware on either the system firmware or the NIC firmware, this didn't show up [10:19:16] I've subscribed to T293740, let's see if the update helps [10:19:17] T293740: Upgrade db2112 firmware/BIOS - https://phabricator.wikimedia.org/T293740 [10:20:24] marostegui: booting up now, IPMI: Boot to PXE Boot Requested by iDRAC [10:20:37] volans: sweet [10:22:08] loading initrd [10:22:25] moment of truth... :) [10:22:29] yeah, that was the case on the buster reimage too [10:22:35] and then nada [10:22:50] marostegui: boot logs [10:22:52] it's booting up [10:22:58] \o/ [10:22:59] d-i menu [10:23:06] this should work fine [10:23:07] So it might be indeed firmware+kernel [10:23:09] detaching console [10:23:40] thanks volans I will you posted! [10:24:26] sorry for not being helpfup [10:24:29] *helpful [10:28:10] you always are! [12:04:15] I was reading the kafka cookbooks https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/kafka/ and realized that both the implementation but even more so the comments are really top-notch [12:04:20] see in particular https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/kafka/__init__.py [12:04:42] kudos to elukey for the excellent work [12:08:29] <3, others also contributed! [13:52:39] sigh. really not a fan of how hard it is to spot the actual error in the CI output. https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/33397/console [13:55:02] <_joe_> kormat: you mean all that stuff around the actual CI run in the logs makes it hard to see where the error was printed? [13:55:39] <_joe_> like if we removed "set -x" from the wrapper script [13:57:24] _joe_: i think that falls into the necessary-but-insufficient category [13:58:02] <_joe_> kormat: to make the rest clearer, we just need to improve the rakefile :) [13:58:50] is that said in the same tone of voice as one would say "just a simple matter of programming"? [14:01:22] <_joe_> cdanis: ofc, but she might fall for it anyways [14:01:41] things which aren't much better: https://puppet-compiler.wmflabs.org/compiler1002/31761/db1099.eqiad.wmnet/change.db1099.eqiad.wmnet.err [14:02:02] _joe_: <3 [14:02:39] <_joe_> kormat: oh in that case it's just matter of deciding we know a smart way of filtering the puppet stderr output and fix the compiler :) [14:04:45] <_joe_> jokes aside, that is much harder to fix than better highlighting of errors in the rakefile [14:10:15] we could keep the colors though for the latter [14:16:07] <_joe_> volans: or we could make a filtere version that just has errors [18:37:39] legoktm: is there a task for envoy in beta? [18:39:33] I'm not aware of one, but it's intentionally disabled there, https://gerrit.wikimedia.org/g/operations/puppet/+/9a6889e30e939f1e9e0e906dc7927c76cf211ad5/hieradata/cloud/eqiad1/deployment-prep/common.yaml#21 [18:41:27] I have a proof of concept envoy running on deployment-taavi-envoy-test.deployment-prep.eqiad1.wikimedia.cloud, but I haven't had any motivation to work on beta recently [19:23:47] very cool maps: https://labs.ripe.net/author/emileaben/latency-into-your-network-as-seen-from-ripe-atlas/ [19:28:14] Hi, random question about ssh. I find old connections are cluttering my tmux. I'm wondering what other ppl do about this. [19:28:14] A couple options I can think of: [19:28:14] - auto reconnect ssh sessions via mosh or other (though apparently mosh doesn't work with bastions) [19:28:14] - auto close ssh sessions so I don't waste time typing into a broken pipe [19:28:14] - continue as I am now and just close these hanging sessions with ~. once I realize they are stuck [19:30:02] razzi: re "auto close ssh sessions" do you mean something like setting ServerAliveInterval in your .ssh/config ? [19:30:15] cdanis: I've never used that but perhaps? [19:30:40] I was reading https://unix.stackexchange.com/questions/280066/how-can-i-auto-close-dropped-ssh-connections and it said it needs client and server config [19:30:59] Maybe I could dig thru our puppet code to find if our ssh config supports that [19:31:07] I would start with that. I think it's enabled by default on Debian systems? but it's just a client-side setting, supported by all servers [19:31:42] 87 TCPKeepAlive yes [19:31:55] ^ from our sshd config template [19:31:58] TCP keepalives are different :) but will serve a similar purpose [19:32:07] however they won't disconnect the client side! [19:32:15] ServerAliveInterval is not mentioned specifically [19:32:47] razzi: I would start with just 'ServerAliveInterval 30' in your ~/.ssh/config for all hosts [19:33:00] ah ClientAliveInterval in the server config, neither [19:34:33] (the other setting that interacts with ServerAliveInterval is ServerAliveCountMax, the number of times the alive-check can fail before the client gives up -- the former defaults to 0 (not enabled) and the latter defaults to 3) [23:33:41] late to the party, but I use `ServerAliveInterval 60` in a `Host *` section of my ~/.ssh/config. When a connection drops it will print something like `client_loop: send disconnect: Broken pipe` and `Connection to $host closed by remote host.`. [23:34:09] because of the new Phab status "in progress" some tickets were disappearing from our SRE Clinic Duty Workboard where some queries just checked for "open or stalled". fixed some, might also happen in other workboards. If you have things like this use "customized query" and "any open status", it is more reliable and I confirmed "In Progress" counts as an open status.