[09:49:05] <vgutierrez>	 jbond: it looks like I broke puppet-merge again with UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 1543: invalid start byte
[09:50:14] <jynus>	 interesting, does it try to print non-utf8 data, or maybe it is so long that it gets truncated mid-multibyte?
[09:50:35] <vgutierrez>	 it's trying to render the output of openssl rand 64 as utf-8
[09:50:42] <vgutierrez>	 or openssl rand 16 in this case
[09:51:04] <vgutierrez>	 and of course it isn't guaranteed to be utf-8
[09:51:49] <jynus>	 this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/824769/36/modules/varnish/files/tests/dp.daily.key
[09:51:50] <jbond>	 vgutierrez: ack one sec i have a patch  let me quickly update it and merge now
[09:51:56] <vgutierrez>	 jbond: thx <3
[09:51:58] <vgutierrez>	 jynus: yep
[09:54:21] <jbond>	 steve_munene: moritzm: ok to mrge your changes as well?
[09:54:27] <jynus>	 (errors="ignore")? or would it be dangerous for security reasons?
[09:54:39] <steve_munene>	 yes please do
[09:54:50] <jbond>	 jynus: see https://gerrit.wikimedia.org/r/c/operations/puppet/+/886006/1/modules/puppetmaster/files/merge_cli/puppet-merge.py#101
[09:55:00] <jbond>	 i have decided to just skip for now with a warning
[09:55:03] <moritzm>	 ack, go ahead please
[09:56:18] <jbond>	 moritzm: vgutierrez: hopefully fix is now deployed ping me if you stil see issues
[09:56:37] <vgutierrez>	 jbond: cheers
[09:56:43] <jbond>	 np
[09:56:45] <jynus>	 jbond: great idea!
[09:57:22] <jynus>	 as that way it is not possible to sneak weird stuff in binary
[09:58:55] <vgutierrez>	 lovely.. I'm an idiot :)
[09:59:06] <jynus>	 vgutierrez: why?
[10:00:45] <vgutierrez>	 just broke varnish
[10:01:05] <jynus>	 :-(
[10:06:21] <vgutierrez>	 luckily I disabled puppet first on cp nodes ,P
[10:26:35] <elukey>	 slyngs o/ - I checked https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/865057 and I am wondering how close it is to prime time, I'd really like to use it in another cookbook to reimage k8s nodes
[10:28:33] <slyngs>	 It's still a bug in the handling of DHCP, and I honestly haven't testet it yet. Someone may or may not have attempted to use it, and ran into the DHCP issue
[10:29:44] <slyngs>	 I can set aside some time next week and see if I can resolve the problem, and then we can test it
[10:30:34] <elukey>	 ack thanks! I think that Janis may have already tested/used it since I saw it mentioned in https://phabricator.wikimedia.org/T326340 (with a custom cookbook config)
[10:31:09] <elukey>	 I am testing it as well in a local checkout on cumin2002, but I keep getting NetboxHostNotFoundError
[10:34:18] <slyngs>	 I'll just add that to the Phabricator task, so I don't forget
[10:35:27] <elukey>	 I can add more details to the task as well
[10:36:23] <slyngs>	 We're using this task: https://phabricator.wikimedia.org/T306661
[10:37:42] <elukey>	 just posted the error :)
[10:38:00] <elukey>	 it may be due to my local cookbook checkout config though
[10:43:03] <elukey>	 I see a specific netbox config with tokens under /etc/spicerack/netbox on cumin nodes, this is probably the issue
[10:44:12] <slyngs>	 I'll need to check, it's my first cookbook, so I'm not don't have the full understanding of the functionality yet.
[10:57:03] <topranks>	 elukey: from my reading of the spicerack code it seems that error should throw if the host the cookbook is targetting can't be found in Netbox
[10:57:14] <topranks>	 https://github.com/wikimedia/operations-software-spicerack/blob/b7e5e009399b4e611a6b5d8149cba11f1c9de373/spicerack/netbox.py#L78
[10:58:24] <elukey>	 topranks: o/ yes yes but it should be there, I am testing a "local" checkout of cookbooks in my home dir on cumin1002, I suspect that for some reason I am missing the config bits for netbox (so it fails to retrieve the node)
[10:58:46] <elukey>	 otherwise I can't explain it
[10:58:47] <elukey>	 :(
[10:59:06] <topranks>	 I'd have thought it'd throw a different error then, like NetboxError or NetboxAPIError
[10:59:20] <topranks>	 from the code I think it can only happen if Netbox gives it a valid "not found" message back
[10:59:40] <topranks>	 one thing to be aware of, not sure if it's relevant, is that "hosts" (physical servers) and vm's are different in Netbox
[10:59:54] <topranks>	 ml-staging-etcd2001 does exist, but it is a VM obviously
[11:00:40] <elukey>	 topranks: yep yep! 
[11:00:41] <topranks>	 so spicerack needs to run   _fetch_virtual_machine(), not _fetch_host() to retrieve it 
[11:02:01] <elukey>	 topranks: I am basically checking https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/865057/10/cookbooks/sre/ganeti/reimage.py#100
[11:02:31] <elukey>	 that in theory should be both virtual and bare metal
[11:02:42] <elukey>	 https://github.com/wikimedia/operations-software-spicerack/blob/a8da7a1f9f444fba7fa3279328f69d3822add2c3/spicerack/__init__.py#L633
[11:04:56] <topranks>	 hmm yep
[11:05:15] <topranks>	 I'm clutching at straws here, a poor substitute for v.olans tbh :)
[11:05:18] <topranks>	 Looking at this:
[11:05:19] <topranks>	 https://github.com/wikimedia/operations-software-spicerack/blob/a8da7a1f9f444fba7fa3279328f69d3822add2c3/spicerack/__init__.py#L633
[11:06:10] <topranks>	 It says not not use the fqdn, wonder if it's something simple like that
[11:06:54] <klausman>	 So just ending in codfw?
[11:07:32] <topranks>	 not even, Netbox query would have to be on 'ml-staging-etcd2001' alone, nothing after
[11:07:42] * elukey cries in a corner
[11:07:52] <elukey>	 topranks: yes this is the issue
[11:08:14] <klausman>	 boo for it being this trivial, yay for finding the issue.
[11:08:33] <topranks>	 I'll take that any day :)
[11:09:43] <klausman>	 there definitely are worse scenarios :)
[11:10:21] <elukey>	 thanks a lot :)
[11:17:15] <elukey>	 topranks: now I have a new error but progress :)
[11:17:20] <elukey>	 <#
[11:17:21] <elukey>	 <3
[11:24:18] <topranks>	 :D
[11:26:45] <elukey>	 all right I was able to run correctly the cookbook, Janis applied changes to it (basically the comments in the code review)
[11:26:53] <elukey>	 and now my cookbook (that calls that one) works :)
[11:26:55] <elukey>	 thanks allll
[11:35:13] <topranks>	 woot!
[11:37:16] <Emperor>	 well done, by rules of sysadmin jenga, you now own all cookbook bugs ;-)
[12:30:43] <claime>	 Hey y'all I'm doing the rounds for T327920 / T328287 (Datacenter switchover), has there been major changes that would make sense to be communicated to the community (apart from multiDC of course) ?
[12:30:44] <stashbot>	 T328287: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287
[12:30:44] <stashbot>	 T327920: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920
[12:31:54] <claime>	 I'm also going through the Datacenter-Switchover backlog so expect some amount of "Is this necessary?" spam.
[12:42:17] <jynus>	 claime: I have one question- will the switchover be done by 1: stop reading on codfw, then 2: switching codfw as the primary or switchig first and then depooling eqiad?
[12:44:52] <claime>	 The switchover will be done by following https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki so read-only, pool both, flip WMFMasterDatacenter switch, depool eqiad, rw afaict
[12:45:50] <jynus>	 interesting, so eqiad will be depooled before setting rw
[12:48:14] <jynus>	 "Invert Redis replication for MediaWiki sessions" is that outadated?
[12:48:21] <claime>	 It is
[12:48:35] <jynus>	 and other stuff, x2, but I don't know much about that
[12:48:42] <jynus>	 in theory it is a noop
[12:49:23] <jynus>	 (I am thinking of things that may be different since last time)
[12:50:22] <jynus>	 I think sessions and x2 were what substituted Redis, but don't trust me 100%
[12:50:44] <jynus>	 s/sessions/sessionstore/
[12:52:34] <taavi>	 claime: I just realized that we moved wikitech databases from m5 to s6 since the last switchover, but it's still hosted on labweb* servers in eqiad only. although the multi-dc work means that it should Just Work, but you might want to double-check that beforehand
[12:55:09] <claime>	 taavi: Thanks for the heads up, adding a phab to check it
[12:55:53] <jynus>	 in theory multimedia is completely separate, but I would check with Emperor the swiftrepl status, in case something would need to change with the new status
[12:56:05] <jynus>	 *new rclone method
[12:56:51] <claime>	 I have that answer already : "Swift replication (swiftrepl as-was, now rclone) runs weekly (when I'm not trying to debug it) on Monday UTC-morning; the script that's run compares confctl --object-type mwconfig select 'name=WMFMasterDatacenter' get to /etc/wikimedia-cluster when deciding to proceed or not."
[12:57:33] <Emperor>	 <3
[12:57:49] <claime>	 So I don't have to bother :D
[13:00:29] <jynus>	 claime: nice!
[13:43:19] <slyngs>	 elukey: So the reimaging cook book more or less works.... WOOT :-)
[13:45:01] <elukey>	 slyngs: haven't tested it yet but I think that Janis used it last week adding manually the changes in the comments and it worked
[13:46:08] <slyngs>	 Very nice, I'll try and collect everything and get it in a reviewable state next week
[13:51:08] <elukey>	 super
[13:57:04] <sukhe>	 really looking forward to it here as well -- will be such a welcome addition
[13:57:07] <sukhe>	 thanks for all the work!
[14:14:52] <elukey>	 jbond: thanks a lot for the review!
[14:16:25] <jbond>	 elukey: no problem
[16:04:03] <claime>	 FYI I'm leaving the removal of disc_desired_state up for discussion until next week https://gerrit.wikimedia.org/r/c/operations/puppet/+/886069 then I'm burning it.
[22:41:02] <urandom>	 brett: any thoughts on how to implement the final actionable on https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues#Actionables?  Basically, until the root cause has been identified and corrected, we should de-pool the affected datacenter any time a host needs to be rebooted (or before bringing it back up, in the event it went down unscheduled).
[22:41:42] <urandom>	 Maybe an email to sre-at-large?
[22:55:02] <brett>	 That sounds good to me!
[22:55:18] <brett>	 and maybe a message in the SAL?
[22:55:33] <brett>	 ^ urandom
[22:55:48] <urandom>	 a message in SAL?
[22:56:05] <brett>	 https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:41] <brett>	 I guess the depool will already announce it
[22:57:01] <urandom>	 brett: right, sorry, I'm wondering what that would look like.  I mean: To let every one know how to handle maintenance tasks (actually, just reboots), until it's probably solved
[22:57:05] <urandom>	 oh, I see
[22:57:21] <urandom>	 yeah, depooling should include SAL entry
[22:57:49] <brett>	 Oh, I misread. Yeah, I think an email would be most appropriate :)
[22:58:03] <urandom>	 s/until it's probably solved/until it's properly solved/
[22:58:24] <urandom>	 properly solving would be better than probably solving :)
[22:58:41] <brett>	 ^^