[09:02:29] <hashar>	 may someone puppet-merge for me a few Apache redirects for doc.wikimedia.org please?  https://gerrit.wikimedia.org/r/c/operations/puppet/+/824542
[09:03:07] <hashar>	 the few `Redirect` statements do not work due to some rewrite rules overlapping them. I have tested it locally and got the expected behavior
[09:13:50] <_joe_>	 hashar: maybe not on a friday?
[09:13:57] <_joe_>	 or is it broken now?
[09:14:14] <hashar>	 the two existing redirects are broken (they yield a 404 instead)
[09:14:17] <hashar>	 which I broke several months ago
[09:14:42] <hashar>	 the other one is to redirect https://doc.wikimedia.org/mw-tools-scap/  to https://doc.wikimedia.org/scap/ 
[09:14:56] <hashar>	 the first is no more updated, the second is the new location and the one CI pushes to
[09:15:07] <hashar>	 so the patch has limited effect ;)
[10:57:08] <_joe_>	 btullis: if you find things missing from the docs to spin up a k8s server, please integrate the docs :)
[10:57:31] <_joe_>	 (see your last change to labs/private where you had to guess/copy from the other stanzas)
[10:59:12] <btullis>	 _joe_: Thanks, I will do my best. There are a couple of places where I'm scratching my head at the moment. I'll definitely update the docs when I think I can improve them :-)
[10:59:29] <jynus>	 this more of a mw question, but hopefuly someone here knows- how could I easyly know if something that uses cache ends up in Memcache vs Redis vs MariaDB?
[10:59:31] <_joe_>	 ask when in doubt!
[11:00:04] <_joe_>	 jynus: know based on what preexisting info?
[11:00:34] <jynus>	 mostly by looking at code of core or an extension
[11:00:50] <_joe_>	 typically, mediawiki-config will have configuration that wires a cache backend to the extension
[11:01:05] <jynus>	 is this static, once configured
[11:01:09] <_joe_>	 if the extension is using bagofstuff as an interface
[11:01:27] <jynus>	 e.g. I will see something like "Localcache" and by looking at config I will now it is memcache?
[11:01:35] <_joe_>	 yes
[11:01:37] <jynus>	 ok
[11:01:47] <_joe_>	 basically
[11:01:50] <jynus>	 so WANCache is x2, is my guess?
[11:02:52] <jynus>	 yeah, I saw bagofstuff reference but as I understood that was a virtual interface that can be implemented with several backends?
[11:03:24] <_joe_>	 no, wancache is replicated memcached
[11:03:29] <_joe_>	 mainstash is x2
[11:03:35] <jynus>	 ok, so that is my confusion
[11:03:46] <_joe_>	 both have a bagostuff interface
[11:04:12] <_joe_>	 so often an extension just uses a bagostuff or wancache or localcache interface 
[11:04:13] <jynus>	 I see, I will have a look at config to see when one or the other is used
[11:04:17] <_joe_>	 that you can inject via config
[11:04:36] <jynus>	 thanks, your pointer was enough for me to look at the config on my own
[11:04:43] <_joe_>	 ack
[11:04:53] <_joe_>	 start from wmf-config/mc.php for instance
[11:05:03] <_joe_>	 that's where we set up the various memcached-based caches
[11:05:06] <jynus>	 it is just that things like objectcache was too ambiguoes to me without context
[11:19:59] <btullis>	 I have a question related to this new kubernetes cluster. Specifically, it's about the cfssl-issuer profile mentioned here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/824163/1/helmfile.d/admin_ng/values/dse-k8s-eqiad/cfssl-issuer-values.yaml
[11:20:20] <btullis>	 I can see that this references the values in multirootca.yaml here: https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/pki/multirootca.yaml#L17
[11:20:48] <_joe_>	 jayme, jbond ^^ I think you're the best people to answer 
[11:22:26] <btullis>	 And there are some matching values in the private repo in `/hieradata/role/common/pki/multirootca.yaml` but I don't know if I just need to create a new entry here, or if the `key` value relates to something else.
[11:30:45] <jbond>	 btullis: you will need to create a new profile in the public repo specifying the policy and the key name to use.  you then need to create a matching key in the private repo with that name e.g. https://phabricator.wikimedia.org/P32585
[11:31:45] <jbond>	 btullis: however keep in mind that cfssl-issuer is for issuing certificates to the containers not the control plane
[11:35:13] <btullis>	 jbond: Thanks for that. Yes I've reverted to using cergen for the control plane hosts, after discussion with jayme. That key in the private repo (https://phabricator.wikimedia.org/P32585$10) is just any old hex string of 16 characters  that I make up, right? 
[11:36:21] <jbond>	 btullis: yes i use the following `openssl rand -hex 8 `
[11:37:34] <btullis>	 Great, thanks. I'll go ahead with this and also update the docs here to include these steps for creating a new signing profile: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#helmfile.d_structure
[11:38:33] <jbond>	 thanks <3
[11:41:11] <_joe_>	 btullis: awesome, thanks :)
[12:36:19] <jayme>	 sorry, was out for lunch. Thanks for writing the docs indeed
[12:40:20] <btullis>	 A pleasure :-) I'm looking at the calico rows E-F configuration now (here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/824163/1/helmfile.d/admin_ng/values/dse-k8s-eqiad/calico-values.yaml#22) and I'll ask again if I can't work it out. 
[12:41:28] <jayme>	 I think we should probably move the docs to https://wikitech.wikimedia.org/wiki/PKI/CA_Operations as it's ultimately a PKI operation and nothing special to kubernetes clusters
[12:42:34] <jayme>	 regarding deployment. You should merge the new profile before bootstrapping the cluster (obviously :)) but it's a totaly independend change otherwise
[12:45:14] <dhinus>	 I'm seeing a strange diff in the package list of a single host (cloudcephosd1033) - hdparm is not installed, though it is in base_packages.txt -- but I'm not sure when/where that file is parsed
[12:57:35] <Krinkle>	 _joe_: https://usercontent.irccloud-cdn.com/file/bYm6mMCH/redis_session_utilization.png
[13:06:25] <cdanis>	 🎉
[13:06:30] <Krinkle>	 feel free to burn down any references to nutcracker and k8s equiv for that port
[13:07:15] <Krinkle>	 details at https://phabricator.wikimedia.org/T314453#8168858
[13:07:47] <Krinkle>	 decom at https://phabricator.wikimedia.org/T267581
[13:10:14] <btullis>	 dhinus: It looks like that file is only used if we manually run `/usr/local/sbin/apt-audit-installed` - standard packages to be installed are here: https://github.com/wikimedia/puppet/blob/production/modules/base/manifests/standard_packages.pp
[13:11:12] <btullis>	 Maybe hdparm used to be installed automatically with older debian versions, but isn't now? 
[13:11:38] <dhinus>	 ah-ha thanks, looks like there are some other packages that just get installed by d-i and are not included in standard_packages.pp
[13:11:57] <dhinus>	 it's not a debian version thing, though, because other servers with the same debian version include hdparm
[13:12:24] <_joe_>	 Krinkle: so I can just turn off redis now before it's too late and you can roolback?
[13:12:55] <Krinkle>	 _joe_: assuming mcrouter shows no problems, yes.
[13:13:17] <Krinkle>	 whether we want to try that on Friday is your call
[13:13:28] <dhinus>	 btullis: I wonder if it's possible that a single package failed to install for (reasons) and the installation completed anyway?
[13:14:15] <Krinkle>	 I don't knwo if nutcracker will make noise if a host is unavailable when there are no incoming queries
[13:14:33] <_joe_>	 Krinkle: nah let's do it next week
[13:15:09] <Krinkle>	 not unlike thumbor, it turns out the majority of activity on redis was essentially 404s
[13:15:32] <Krinkle>	 we're looking up randomly generated CP offsets even if there is no cookie incoming.
[13:15:38] <Krinkle>	 on every page view
[13:15:49] <cdanis>	 chronology protector?
[13:15:56] <btullis>	 dhinus: It will get pulled in automatically by the ceph-base package, when that is installed.
[13:15:58] <Krinkle>	 the cookie being the cpPosIndex that tells us you had a previously saved offset in CP
[13:15:59] <Krinkle>	 yeah
[13:16:14] <cdanis>	 i wasn't sure if you meant chronology protector, changeprop, or something else
[13:16:18] <btullis>	 https://www.irccloud.com/pastebin/KZN80u9w/
[13:16:18] <cdanis>	 :)
[13:17:42] <Krinkle>	 cdanis: when instantiating the CP class, we use the cookie if there is one and otherwise generate a random ID. If you make writes, we'll use the ID to save an offset to CP-store (now memcached, so far redis) and emit a cookie with that ID for it to find on the next req.
[13:18:06] <Krinkle>	 but the logic for reading the offset and potentially waiting for replication... doesnt care whether the ID came from a cookie or was just generated and thus can't possibly have aything in it
[13:18:23] <Krinkle>	 it's something like 0.1ms but still something
[13:18:49] <Krinkle>	 I've worked much harder to shave off 0.1ms
[13:19:58] <dhinus>	 btullis: true, but still doesn't explain why it's installed in e.g. cloudcephosd1032 where ceph is not installed yet
[13:20:38] <dhinus>	 entirely possible it was installed with cumin though, while debugging these new group of servers
[13:22:20] <dhinus>	 I was expecting puppet to ensure the entire package list was in sync but maybe that would cause too many problems
[13:23:21] <andrewbogott>	 puppet only adds packages, never removes them unless something is very explicit
[13:23:40] <_joe_>	 dhinus: puppet's package resource is atomic, meaning it only handles things you explicitly name in your manifests
[13:23:55] <_joe_>	 and every single package is treated as its own thing
[13:24:17] <cdanis>	 you can see in standard_packages.pp linked earlier some ensure=>absent and some ensure=>purged
[13:24:29] <_joe_>	 also not all providers of the package resource offer the instances interface that puppet would need to use to enforce global state
[13:24:29] <cdanis>	 which is the only way puppet will remove a package, or in general, a resource
[13:25:09] <_joe_>	 (for instance, we have package resources that use a provider different from "apt", the standard under debian - one example being our "scap3" provider)
[13:25:18] <dhinus>	 thanks everyone, that's much clearer now :)
[14:13:38] <claime>	 andrewbogott: That's true for every resource in puppet actually
[14:14:07] <claime>	 If you declare a file as ensure => present, run the agent, and remove the declaration from the manifest, the file will stay
[14:14:24] <claime>	 You need to explicitly define it as ensure => absent for puppet to remove the file
[14:14:29] <andrewbogott>	 claime: true although we somewhat often actively absent resources in puppet defs but seldom absent packages.  (Not really a fact about puppet, more about our use of it here)
[14:14:42] <claime>	 andrewbogott: Fair enough :)
[14:15:19] <_joe_>	 claime: with the exception of directory resources where you declare recurse => true and purge => true
[14:15:24] <claime>	 andrewbogott: I picked up the habit at $lastjob to always ensure absent before removing, whatever the resource, but absenting packages
[14:15:32] <claime>	 _joe_: yeah, with that exception
[14:15:33] <_joe_>	 we do that for e.g. apache websites
[14:15:39] <claime>	 good to know
[14:16:17] <claime>	 I didn't finish my sentence lol "but absenting packages can have some weird side effects with dependencies"
[14:17:00] <andrewbogott>	 my mindset is usually that if I'm about to absent a bunch of packages that probably means it's time to reimage instead :)
[14:17:14] <claime>	 completely fair :)
[14:21:26] <Krinkle>	 I'm reviewing incidents for the Excellence monthly. Two of them are lacking some impact, and I'm filling something in for now. Will re-review next monday/tuesday before sending out.
[14:21:27] <Krinkle>	 - https://wikitech.wikimedia.org/wiki/Incidents/2022-07-03_shellbox_request_spike
[14:21:37] <Krinkle>	 Assumed based on Grafana and Logstash was the impact was.
[14:29:14] <Krinkle>	 - https://wikitech.wikimedia.org/wiki/Incidents/2022-07-11_Shellbox_and_parsoid_saturation
[14:29:20] <Krinkle>	 Assumed based on mobileapps dashboard
[14:30:38] <Krinkle>	 cc sobanski, arnoldokoth 
[14:31:52] <sobanski>	 Thanks!
[14:35:59] <Krinkle>	 - https://wikitech.wikimedia.org/wiki/Incidents/2022-07-12_codfw_A5_powercycle
[14:36:43] <Krinkle>	 this is the third one like it in 15 days time (linked others in See also). Might warrant an actionable. Not really sure, maybe it's a calculated risk intentionally done that way?
[14:37:08] <Krinkle>	 cc bblack XioNoX 
[14:37:08] <jynus>	 there was an actionable
[14:37:19] <jynus>	 even if not properly documented
[14:37:36] <jynus>	 PDU work was moved from no impacting to graceful shutdown
[14:37:53] <jynus>	 it is more or less followed on the PDU maintenance tickets
[14:38:14] <Krinkle>	 ack, so the July 12th would be the last of its kind
[14:39:29] <jynus>	 I suggest talking to wiki_willy to suggest leaving a trail of the decision on the doc reports
[14:39:50] <jynus>	 my guess is that it was so obvious that it wasn't documented there :-D
[14:40:19] <jynus>	 but maybe it was on SRE meetings and tickets related to the pdu
[14:40:25] <Krinkle>	 so the action item was, to mark services as down before reboot
[14:40:37] <Krinkle>	 and ticket is T309957, same as the previous two?
[14:40:38] <stashbot>	 T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957
[14:40:38] <jynus>	 no, it was decided to change the maintenance strategy
[14:40:53] <jynus>	 from no impact, to shutdown for a shorter period
[14:41:10] <jynus>	 and notify service owners beforehand, etc
[14:41:21] <Krinkle>	 ok mind if I add "PDU work was moved from no impacting to graceful shutdown for a shorter period." to the wiki page?
[14:41:29] <jynus>	 again, talk to wiki_willy :-D
[14:41:42] <jynus>	 I was on vacation, so that is a heresay
[14:42:04] <Krinkle>	 ack
[14:42:20] <jynus>	 the other people you can contact are the ICs
[14:42:33] <jynus>	 they own having the docs properly documented
[14:42:56] <jynus>	 so if you see gaps, they should be the point of contact to route to the people on the know
[14:43:55] <jynus>	 my guess is those were light reports because they had no user impact, but I think the last one was much more impacting
[14:44:50] <jynus>	 also, not sending you to wiki_willy on a void, he litterally asked me to route to him any comments about pdu maintenance :-D
[14:48:19] <Krinkle>	 ack, already done :)
[14:49:22] <Krinkle>	 i'm also marking varnish/appserver overload cases that are part of larger patterns as 'final' assuming no specific review will happen there.
[14:51:09] <jynus>	 yeah, I belive that is handled on its own working group/ticket
[14:51:57] <jynus>	 remember also this should be helpful for tracking and further info: https://docs.google.com/spreadsheets/d/1EYbMt6xTCDBaWfrPgu8Z1a3CvYrxbH1uH4kVf8MOQfQ
[14:55:03] <jynus>	 thanks for doing the work, BTW
[16:22:55] <Krinkle>	 lmata: only a few drafts left at https://wikitech.wikimedia.org/wiki/Category:Incident_documentation_drafts now. Perhaps even short enough to be worth going through the ~3 old ones at some point
[16:23:51] <Krinkle>	 basically anything that was fairly well-written and more than 6 months old, I removed the draft questions, made sure stuff is on phab and tracked, and the moved to either 'in-review' or 'final'.  
[16:27:54] <lmata>	 Thank you Krinkle, will take a look.