[07:01:04] moritzm: good morning. Thanks for the review on https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/724461 . Do you think that there might be race conditions for which we can ssh into d-i but /target is not yet there? I can safely add a sleep given that we need to wait for d-i to complete it will not add any waiting time to the reimage process. [07:12:04] let me check which step mounts /target, but I'm pretty sure it's before SSH is up [07:14:24] the netcfg d-i component already operates on /target, so that should be totally reliable and race-free [07:14:52] great! thanks for the check :) [09:51:30] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10cmooney) Thanks for the detail @aborrero Looking at the setup the logical thing is to allocate the public IPs for these hosts from t... [10:05:27] moritzm: puppet disabled on cumin2002, leftover or WIP? [10:05:39] WIP, should be done in 5m [10:05:46] no hurry, thx [10:05:50] take your time :) [10:07:23] I'm done, puppet is enabled again [10:08:50] ack, thx [10:09:52] moritzm: I'd like to rest another reimage of sretest, any preference on which one I should pick? either is ok for me [10:09:57] in case you're working on any of those [10:19:29] either is fine with me as well, when in doubt throw a dice :-) [10:20:31] lol thx [10:29:29] mmmh moritzm it seems we might have a race with [[ -d "/target" ]], it returned 1 but if I connect with install_console it returns 0 as expected [10:33:08] hmmh, that's unexpected, can I have look via install_console, did you use 1001 or 1002? [10:33:20] 1001 [10:33:25] is now in d-i [10:33:28] so be quick :) [10:33:52] on it :-) [10:34:28] sorry, too late, we can reimage [10:34:35] it got up with the host now [10:34:47] let me interrupt and restart [10:35:28] as it currently stops for my input there [10:35:32] so I can leave it there for you moritzm [10:35:51] rebootin now [10:38:35] ack, thx [10:39:00] moritzm: all yours, and take your time, it's waiting for my input. Same error [10:39:21] actually not, sorry d-i keeps runnign in the background, just the cookbook waits for me [10:39:26] so yeah be quick :) [10:39:32] or stop d-i [10:41:35] no worries, I'm looking at a d-i checkout, but it's puzzling [10:53:54] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) puppetdb was fairly stable for some time, however we had to add some of the facts back into puppetdb specifically the numa and partitions... [11:02:30] moritzm: did you find anything? should I continue with this test reimage for now? [11:04:42] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) Its also worth noting that the increase in processing times no longer aligns with [[ https://grafana.wikimedia.org/d/C0lCOf3Mz/puppetdb-p... [11:07:15] yeah, you can continue, but it's currently not clear to me how we're unable to SSH into d-i without /target being present, a few things to maybe test would be: [11:08:50] - we could try to repeat the test a few times with a small interval, although I can't see why there's clearly a race :-) [11:09:54] - I think /target is a bind mount, maybe busybox doesn't test against it with -d, maybe -e makes a difference (but then it seems to have worked fine after you logged in via install_console, so that seems unlikely) [11:10:42] - we could test for /proc/cmdline instead of /target [11:11:28] I can't name the option to test for out of the top of my head, but there should be flags distinctively from a booted system [11:11:55] and given that cmdline gets passed to the kernel it won't be able to change/race [11:13:42] ack I'll look into that [11:13:42] what did we use in the old script, did no such test/race exist there? [11:14:04] didn't chck, assumed the first host up will be d-i and the second a fresh system [11:14:17] if it was a d-i loop would not catch it [11:14:21] ah, ok [11:15:33] I'd say: let's kickoff a reimage of 1001 again and capture /proc/cmdline during d-i? that's probably the most reliable lead [11:15:53] yes was planning to do that [11:16:58] ack [11:18:54] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) > however there is an earlier peak at ~10:15 on the 21st worth exploring This alligned to the following puppetdb log event ` 2021-09-21T... [11:21:08] moritzm: among other things [11:21:09] BOOT_IMAGE=debian-installer/amd64/linux initrd=debian-installer/amd64/initrd.gz [11:21:35] I guess a grep -q 'debian-installer' might work [11:22:01] also DEBCONF_DEBUG=5 but probably less reliable [11:23:49] or BOOT_IMAGE, not sure [11:24:18] I meant 'BOOT_IMAGE=debian-installer' ofc [11:26:01] ack, "BOOT_IMAGE=debian-installer" sounds promising [11:27:42] ack, changing it after lunch [11:31:21] enjoy lunch :-) [12:33:54] I was checking backup errors, and I saw a warning on netmon backups [12:35:37] Could not stat "/var/lib/librenms": ERR=No such file or directory [12:35:46] hopefuly that is expected [12:36:17] 15 GB are being backed up from /srv/librenms [12:52:20] patch sent [14:19:37] 10CAS-SSO, 10Infrastructure-Foundations, 10Security-Team, 10GitLab (Auth & Access), and 2 others: Open gitlab.wikimedia.org to all users with Wikimedia developer accounts - https://phabricator.wikimedia.org/T288162 (10brennen) 05Open→03Resolved a:03brennen The instance is now open to all developer ac... [16:11:32] 10CAS-SSO, 10Infrastructure-Foundations, 10GitLab (Auth & Access), 10Release-Engineering-Team (Radar): Attempting to login to gitlab.wikimedia.org sometimes results in CAS 500 Internal Server Error - https://phabricator.wikimedia.org/T291964 (10jbond) it seems like the error page sometimes gets blocked by... [16:14:57] 10CAS-SSO, 10Infrastructure-Foundations, 10GitLab (Auth & Access), 10Release-Engineering-Team (Radar): Attempting to login to gitlab.wikimedia.org sometimes results in CAS 500 Internal Server Error - https://phabricator.wikimedia.org/T291964 (10jbond) noting here that @RhinosF1 also reported this issue via... [16:46:42] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10cmooney) Thanks for the time in the meeting today to discuss. From our chat and a few other things I've looked at we can say: - Thes... [17:02:03] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) Pretty much agree with everything you commented @cmooney Just a couple of clarifications: * the servers primary hostname wo... [17:03:05] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [17:03:49] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [17:04:11] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [17:28:24] 10netops, 10Infrastructure-Foundations: Netbox info missing on some WMCS elements - https://phabricator.wikimedia.org/T292097 (10cmooney) [17:28:38] 10netops, 10Infrastructure-Foundations: Netbox info missing on some WMCS elements - https://phabricator.wikimedia.org/T292097 (10cmooney) p:05Triage→03Lowest a:03cmooney [17:50:22] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) Seems like we have success :) Port is now up and MAC address learnt: ` cmooney@asw-a-codfw> show ethernet-switching table | match 1/... [18:01:19] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10cmooney) Ok great @aborrero thanks for clarifying. That all 100% fits what I had in mind, so we are on the same page. I'll discuss w...