[05:52:19] 10CAS-SSO, 10Infrastructure-Foundations, 10GitLab (Auth & Access), 10Patch-For-Review, and 2 others: Open gitlab.wikimedia.org to all users with Wikimedia developer accounts - https://phabricator.wikimedia.org/T288162 (10greg) [05:52:57] 10CAS-SSO, 10Infrastructure-Foundations, 10GitLab (Auth & Access), 10Patch-For-Review, and 2 others: Open gitlab.wikimedia.org to all users with Wikimedia developer accounts - https://phabricator.wikimedia.org/T288162 (10greg) [06:45:03] moritzm, jbond: FYI I'm reimaging both sretest hosts to test the dhcp options (auto vs hardcoded) [06:52:22] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by volans@cumin2002 for host sretest1002.eqiad.wmnet [06:52:32] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage executed with errors: - sretest1002 (**FAIL**) - Downtimed on Icinga - Disabled Puppet - Remov... [06:54:58] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by volans@cumin2002 for host sretest1001.eqiad.wmnet [07:00:17] ack, thanks [07:14:52] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by volans@cumin2002 for host sretest1002.eqiad.wmnet [07:21:24] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - sretest1001 (**PASS**) - Downtimed on Icinga - Disabled Puppet - Removed from Pup... [07:38:58] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - sretest1002 (**PASS**) - Downtimed on Icinga - Disabled Puppet - Removed from Pup... [07:39:56] moritzm: so, all worked fine (just a typo, already fixed in spicerack I need to just make a release) [07:40:18] it's ok for you if sretest1002 is the first (and right now only) host that requires the reimage cookbook to be reimaged? [07:40:40] as it doesn't have the dhcp record hardcoded it requires the dhcp automation of the cookbook [07:41:49] sure, sounds good [07:42:26] sudo cookbook sre.experimental.reimage --os bullseye sretest1002 [07:42:42] fwiw, apart the rename that will probably move the cookbook to sre.hosts.reimage today or tomorrow [08:31:26] moritzm: AFAYK there is still need to be able to reimage something to stretch? or can I assume that will never happen? [08:36:58] yeah, it could surely happen, e.g. if an elastic node needs to be reimaged after hardware maintenance [08:38:12] ok, adding stretch to the possible --os too then [08:38:44] ack [08:41:20] as for the migration path wmf-auto-reimage -> cookbook -> automatic dhcp [08:41:46] I'm not sure if it would be easier for the SREs to pick up the 2 changes separately or put them together in a single process change [08:41:58] from the reimage script directly to the reimage cookbook with automatic dhcp [08:42:01] thoughts? [08:53:41] I'd combine them, we could simply make the old reimage script fail if used against hosts which are not configured for automatic DHCP [08:54:08] makes it easier to adapt to a new workflow and increased use of the cookbook will also flesh out all remaining edge cases [08:54:53] you mean migrating all dhcp pretty much at once? [08:55:00] in that case I'll just remove the reimage script [09:00:34] or that. I mean: [09:01:15] realistically if there's any issues they'd be caught by the first 1-3 reimages [09:01:24] so it's not a long time frame where things are in flux [09:02:12] yeah I've already reimaged both sretest hosts, got manuel to reimage a db right now that just completed [09:02:37] yeah, I'd recommend to be bold, but your call obviously :-) [09:03:28] * volans is now known as **volans** also known as bolans :-P [09:07:18] moritzm: do you know anything about a E: Unable to locate package megacli while reimaging? [09:07:29] the first puppet run completed but technically had a failure for that [09:07:32] https://puppetboard.wikimedia.org/report/db2080.codfw.wmnet/4146bbe2f90b2fa3767e30a9cb6a55232e22f0b0 [09:07:48] might be a problem of ordering, but not sure if already known [09:10:16] it basically required a second run to complete the installation [09:16:03] which OS? [09:16:24] buster [09:16:36] megacli is added by thirdparty/hwraid, this is most definitely a puppet ordering issue [09:17:02] i.e. it tries to install it before the thirdparty/hwraid apt source is added [09:17:07] ep [09:17:08] yep [09:17:53] class raid::megaraid simply uses require_package('megacli' [09:19:13] we can't use apt::package_from_component there, but a similar exec on apt-get update and a dependency on the component should fix it [09:20:09] or actually I think we could even simply use package::from_component [09:20:34] IIRC it should properly handle thirdparty/hwraid being added multiple times these days [09:23:20] ack [09:23:24] but then we have plenty of similar cases where a role only properly works after the second puppet run, though. it's also the case for the mw roles at least [09:23:25] I can look at it [11:01:54] 10SRE-tools, 10Infrastructure-Foundations: Introduce Spicerack.kafka module, along with the method to transfer offset state between consumer groups and clusters - https://phabricator.wikimedia.org/T291681 (10Zbyszko) > * Does this method returns anything? No, there's no need - if there are issues, there will b... [11:11:29] 10Puppet, 10Infrastructure-Foundations: investigate how rspec parses define paramters - https://phabricator.wikimedia.org/T291374 (10jbond) 05Open→03Resolved As pointed out on the issue linked above the issue here is that the facts variable needs to be defined in the os context e.g. it was also pointed ou... [14:55:24] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) [14:56:34] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) >>! In T270071#7375908, @akosiaris wrote: > I think we first need to recap a bit where we are at and what is still a problem. I think some of th... [15:47:58] 10Puppet, 10Infrastructure-Foundations: investigate how rspec parses define parameters - https://phabricator.wikimedia.org/T291374 (10Aklapper) [16:02:21] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) a:05nskaggs→03aborrero We will need 2 NICs connected on these servers: * primary NIC, with a public IPv4 address, `cloud... [16:10:29] 10CAS-SSO, 10GitLab, 10Infrastructure-Foundations: Attempting to login to gitlab.wikimedia.org sometimes results in CAS 500 Internal Server Error - https://phabricator.wikimedia.org/T291964 (10dancy) [16:12:06] 10CAS-SSO, 10Infrastructure-Foundations, 10GitLab (Auth & Access), 10Release-Engineering-Team (Radar): Attempting to login to gitlab.wikimedia.org sometimes results in CAS 500 Internal Server Error - https://phabricator.wikimedia.org/T291964 (10brennen) [16:22:17] jbond: by any chance around? (only if you're feeling ok) [16:24:12] I think we might have an issue with the puppet NOOP run in the reimage.. that might (or might not) bite us in random hosts too [16:30:51] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Audit usages or the realm variable with a view to drop it - https://phabricator.wikimedia.org/T289661 (10dcaro) [18:18:40] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10Infrastructure-Foundations, 10serviceops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @hashar This should be between netops and dcops I think. [22:43:23] 10CAS-SSO, 10Infrastructure-Foundations, 10Security-Team, 10GitLab (Auth & Access), and 3 others: Open gitlab.wikimedia.org to all users with Wikimedia developer accounts - https://phabricator.wikimedia.org/T288162 (10brennen) Tagging Security-Team for awareness, per discussion.