[09:49:05] jbond: it looks like I broke puppet-merge again with UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 1543: invalid start byte [09:50:14] interesting, does it try to print non-utf8 data, or maybe it is so long that it gets truncated mid-multibyte? [09:50:35] it's trying to render the output of openssl rand 64 as utf-8 [09:50:42] or openssl rand 16 in this case [09:51:04] and of course it isn't guaranteed to be utf-8 [09:51:49] this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/824769/36/modules/varnish/files/tests/dp.daily.key [09:51:50] vgutierrez: ack one sec i have a patch let me quickly update it and merge now [09:51:56] jbond: thx <3 [09:51:58] jynus: yep [09:54:21] steve_munene: moritzm: ok to mrge your changes as well? [09:54:27] (errors="ignore")? or would it be dangerous for security reasons? [09:54:39] yes please do [09:54:50] jynus: see https://gerrit.wikimedia.org/r/c/operations/puppet/+/886006/1/modules/puppetmaster/files/merge_cli/puppet-merge.py#101 [09:55:00] i have decided to just skip for now with a warning [09:55:03] ack, go ahead please [09:56:18] moritzm: vgutierrez: hopefully fix is now deployed ping me if you stil see issues [09:56:37] jbond: cheers [09:56:43] np [09:56:45] jbond: great idea! [09:57:22] as that way it is not possible to sneak weird stuff in binary [09:58:55] lovely.. I'm an idiot :) [09:59:06] vgutierrez: why? [10:00:45] just broke varnish [10:01:05] :-( [10:06:21] luckily I disabled puppet first on cp nodes ,P [10:26:35] slyngs o/ - I checked https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/865057 and I am wondering how close it is to prime time, I'd really like to use it in another cookbook to reimage k8s nodes [10:28:33] It's still a bug in the handling of DHCP, and I honestly haven't testet it yet. Someone may or may not have attempted to use it, and ran into the DHCP issue [10:29:44] I can set aside some time next week and see if I can resolve the problem, and then we can test it [10:30:34] ack thanks! I think that Janis may have already tested/used it since I saw it mentioned in https://phabricator.wikimedia.org/T326340 (with a custom cookbook config) [10:31:09] I am testing it as well in a local checkout on cumin2002, but I keep getting NetboxHostNotFoundError [10:34:18] I'll just add that to the Phabricator task, so I don't forget [10:35:27] I can add more details to the task as well [10:36:23] We're using this task: https://phabricator.wikimedia.org/T306661 [10:37:42] just posted the error :) [10:38:00] it may be due to my local cookbook checkout config though [10:43:03] I see a specific netbox config with tokens under /etc/spicerack/netbox on cumin nodes, this is probably the issue [10:44:12] I'll need to check, it's my first cookbook, so I'm not don't have the full understanding of the functionality yet. [10:57:03] elukey: from my reading of the spicerack code it seems that error should throw if the host the cookbook is targetting can't be found in Netbox [10:57:14] https://github.com/wikimedia/operations-software-spicerack/blob/b7e5e009399b4e611a6b5d8149cba11f1c9de373/spicerack/netbox.py#L78 [10:58:24] topranks: o/ yes yes but it should be there, I am testing a "local" checkout of cookbooks in my home dir on cumin1002, I suspect that for some reason I am missing the config bits for netbox (so it fails to retrieve the node) [10:58:46] otherwise I can't explain it [10:58:47] :( [10:59:06] I'd have thought it'd throw a different error then, like NetboxError or NetboxAPIError [10:59:20] from the code I think it can only happen if Netbox gives it a valid "not found" message back [10:59:40] one thing to be aware of, not sure if it's relevant, is that "hosts" (physical servers) and vm's are different in Netbox [10:59:54] ml-staging-etcd2001 does exist, but it is a VM obviously [11:00:40] topranks: yep yep! [11:00:41] so spicerack needs to run _fetch_virtual_machine(), not _fetch_host() to retrieve it [11:02:01] topranks: I am basically checking https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/865057/10/cookbooks/sre/ganeti/reimage.py#100 [11:02:31] that in theory should be both virtual and bare metal [11:02:42] https://github.com/wikimedia/operations-software-spicerack/blob/a8da7a1f9f444fba7fa3279328f69d3822add2c3/spicerack/__init__.py#L633 [11:04:56] hmm yep [11:05:15] I'm clutching at straws here, a poor substitute for v.olans tbh :) [11:05:18] Looking at this: [11:05:19] https://github.com/wikimedia/operations-software-spicerack/blob/a8da7a1f9f444fba7fa3279328f69d3822add2c3/spicerack/__init__.py#L633 [11:06:10] It says not not use the fqdn, wonder if it's something simple like that [11:06:54] So just ending in codfw? [11:07:32] not even, Netbox query would have to be on 'ml-staging-etcd2001' alone, nothing after [11:07:42] * elukey cries in a corner [11:07:52] topranks: yes this is the issue [11:08:14] boo for it being this trivial, yay for finding the issue. [11:08:33] I'll take that any day :) [11:09:43] there definitely are worse scenarios :) [11:10:21] thanks a lot :) [11:17:15] topranks: now I have a new error but progress :) [11:17:20] <# [11:17:21] <3 [11:24:18] :D [11:26:45] all right I was able to run correctly the cookbook, Janis applied changes to it (basically the comments in the code review) [11:26:53] and now my cookbook (that calls that one) works :) [11:26:55] thanks allll [11:35:13] woot! [11:37:16] well done, by rules of sysadmin jenga, you now own all cookbook bugs ;-) [12:30:43] Hey y'all I'm doing the rounds for T327920 / T328287 (Datacenter switchover), has there been major changes that would make sense to be communicated to the community (apart from multiDC of course) ? [12:30:44] T328287: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 [12:30:44] T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 [12:31:54] I'm also going through the Datacenter-Switchover backlog so expect some amount of "Is this necessary?" spam. [12:42:17] claime: I have one question- will the switchover be done by 1: stop reading on codfw, then 2: switching codfw as the primary or switchig first and then depooling eqiad? [12:44:52] The switchover will be done by following https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki so read-only, pool both, flip WMFMasterDatacenter switch, depool eqiad, rw afaict [12:45:50] interesting, so eqiad will be depooled before setting rw [12:48:14] "Invert Redis replication for MediaWiki sessions" is that outadated? [12:48:21] It is [12:48:35] and other stuff, x2, but I don't know much about that [12:48:42] in theory it is a noop [12:49:23] (I am thinking of things that may be different since last time) [12:50:22] I think sessions and x2 were what substituted Redis, but don't trust me 100% [12:50:44] s/sessions/sessionstore/ [12:52:34] claime: I just realized that we moved wikitech databases from m5 to s6 since the last switchover, but it's still hosted on labweb* servers in eqiad only. although the multi-dc work means that it should Just Work, but you might want to double-check that beforehand [12:55:09] taavi: Thanks for the heads up, adding a phab to check it [12:55:53] in theory multimedia is completely separate, but I would check with Emperor the swiftrepl status, in case something would need to change with the new status [12:56:05] *new rclone method [12:56:51] I have that answer already : "Swift replication (swiftrepl as-was, now rclone) runs weekly (when I'm not trying to debug it) on Monday UTC-morning; the script that's run compares confctl --object-type mwconfig select 'name=WMFMasterDatacenter' get to /etc/wikimedia-cluster when deciding to proceed or not." [12:57:33] <3 [12:57:49] So I don't have to bother :D [13:00:29] claime: nice! [13:43:19] elukey: So the reimaging cook book more or less works.... WOOT :-) [13:45:01] slyngs: haven't tested it yet but I think that Janis used it last week adding manually the changes in the comments and it worked [13:46:08] Very nice, I'll try and collect everything and get it in a reviewable state next week [13:51:08] super [13:57:04] really looking forward to it here as well -- will be such a welcome addition [13:57:07] thanks for all the work! [14:14:52] jbond: thanks a lot for the review! [14:16:25] elukey: no problem [16:04:03] FYI I'm leaving the removal of disc_desired_state up for discussion until next week https://gerrit.wikimedia.org/r/c/operations/puppet/+/886069 then I'm burning it. [22:41:02] brett: any thoughts on how to implement the final actionable on https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues#Actionables? Basically, until the root cause has been identified and corrected, we should de-pool the affected datacenter any time a host needs to be rebooted (or before bringing it back up, in the event it went down unscheduled). [22:41:42] Maybe an email to sre-at-large? [22:55:02] That sounds good to me! [22:55:18] and maybe a message in the SAL? [22:55:33] ^ urandom [22:55:48] a message in SAL? [22:56:05] https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:41] I guess the depool will already announce it [22:57:01] brett: right, sorry, I'm wondering what that would look like. I mean: To let every one know how to handle maintenance tasks (actually, just reboots), until it's probably solved [22:57:05] oh, I see [22:57:21] yeah, depooling should include SAL entry [22:57:49] Oh, I misread. Yeah, I think an email would be most appropriate :) [22:58:03] s/until it's probably solved/until it's properly solved/ [22:58:24] properly solving would be better than probably solving :) [22:58:41] ^^