[09:02:29] may someone puppet-merge for me a few Apache redirects for doc.wikimedia.org please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/824542 [09:03:07] the few `Redirect` statements do not work due to some rewrite rules overlapping them. I have tested it locally and got the expected behavior [09:13:50] <_joe_> hashar: maybe not on a friday? [09:13:57] <_joe_> or is it broken now? [09:14:14] the two existing redirects are broken (they yield a 404 instead) [09:14:17] which I broke several months ago [09:14:42] the other one is to redirect https://doc.wikimedia.org/mw-tools-scap/ to https://doc.wikimedia.org/scap/ [09:14:56] the first is no more updated, the second is the new location and the one CI pushes to [09:15:07] so the patch has limited effect ;) [10:57:08] <_joe_> btullis: if you find things missing from the docs to spin up a k8s server, please integrate the docs :) [10:57:31] <_joe_> (see your last change to labs/private where you had to guess/copy from the other stanzas) [10:59:12] _joe_: Thanks, I will do my best. There are a couple of places where I'm scratching my head at the moment. I'll definitely update the docs when I think I can improve them :-) [10:59:29] this more of a mw question, but hopefuly someone here knows- how could I easyly know if something that uses cache ends up in Memcache vs Redis vs MariaDB? [10:59:31] <_joe_> ask when in doubt! [11:00:04] <_joe_> jynus: know based on what preexisting info? [11:00:34] mostly by looking at code of core or an extension [11:00:50] <_joe_> typically, mediawiki-config will have configuration that wires a cache backend to the extension [11:01:05] is this static, once configured [11:01:09] <_joe_> if the extension is using bagofstuff as an interface [11:01:27] e.g. I will see something like "Localcache" and by looking at config I will now it is memcache? [11:01:35] <_joe_> yes [11:01:37] ok [11:01:47] <_joe_> basically [11:01:50] so WANCache is x2, is my guess? [11:02:52] yeah, I saw bagofstuff reference but as I understood that was a virtual interface that can be implemented with several backends? [11:03:24] <_joe_> no, wancache is replicated memcached [11:03:29] <_joe_> mainstash is x2 [11:03:35] ok, so that is my confusion [11:03:46] <_joe_> both have a bagostuff interface [11:04:12] <_joe_> so often an extension just uses a bagostuff or wancache or localcache interface [11:04:13] I see, I will have a look at config to see when one or the other is used [11:04:17] <_joe_> that you can inject via config [11:04:36] thanks, your pointer was enough for me to look at the config on my own [11:04:43] <_joe_> ack [11:04:53] <_joe_> start from wmf-config/mc.php for instance [11:05:03] <_joe_> that's where we set up the various memcached-based caches [11:05:06] it is just that things like objectcache was too ambiguoes to me without context [11:19:59] I have a question related to this new kubernetes cluster. Specifically, it's about the cfssl-issuer profile mentioned here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/824163/1/helmfile.d/admin_ng/values/dse-k8s-eqiad/cfssl-issuer-values.yaml [11:20:20] I can see that this references the values in multirootca.yaml here: https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/pki/multirootca.yaml#L17 [11:20:48] <_joe_> jayme, jbond ^^ I think you're the best people to answer [11:22:26] And there are some matching values in the private repo in `/hieradata/role/common/pki/multirootca.yaml` but I don't know if I just need to create a new entry here, or if the `key` value relates to something else. [11:30:45] btullis: you will need to create a new profile in the public repo specifying the policy and the key name to use. you then need to create a matching key in the private repo with that name e.g. https://phabricator.wikimedia.org/P32585 [11:31:45] btullis: however keep in mind that cfssl-issuer is for issuing certificates to the containers not the control plane [11:35:13] jbond: Thanks for that. Yes I've reverted to using cergen for the control plane hosts, after discussion with jayme. That key in the private repo (https://phabricator.wikimedia.org/P32585$10) is just any old hex string of 16 characters that I make up, right? [11:36:21] btullis: yes i use the following `openssl rand -hex 8 ` [11:37:34] Great, thanks. I'll go ahead with this and also update the docs here to include these steps for creating a new signing profile: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#helmfile.d_structure [11:38:33] thanks <3 [11:41:11] <_joe_> btullis: awesome, thanks :) [12:36:19] sorry, was out for lunch. Thanks for writing the docs indeed [12:40:20] A pleasure :-) I'm looking at the calico rows E-F configuration now (here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/824163/1/helmfile.d/admin_ng/values/dse-k8s-eqiad/calico-values.yaml#22) and I'll ask again if I can't work it out. [12:41:28] I think we should probably move the docs to https://wikitech.wikimedia.org/wiki/PKI/CA_Operations as it's ultimately a PKI operation and nothing special to kubernetes clusters [12:42:34] regarding deployment. You should merge the new profile before bootstrapping the cluster (obviously :)) but it's a totaly independend change otherwise [12:45:14] I'm seeing a strange diff in the package list of a single host (cloudcephosd1033) - hdparm is not installed, though it is in base_packages.txt -- but I'm not sure when/where that file is parsed [12:57:35] _joe_: https://usercontent.irccloud-cdn.com/file/bYm6mMCH/redis_session_utilization.png [13:06:25] 🎉 [13:06:30] feel free to burn down any references to nutcracker and k8s equiv for that port [13:07:15] details at https://phabricator.wikimedia.org/T314453#8168858 [13:07:47] decom at https://phabricator.wikimedia.org/T267581 [13:10:14] dhinus: It looks like that file is only used if we manually run `/usr/local/sbin/apt-audit-installed` - standard packages to be installed are here: https://github.com/wikimedia/puppet/blob/production/modules/base/manifests/standard_packages.pp [13:11:12] Maybe hdparm used to be installed automatically with older debian versions, but isn't now? [13:11:38] ah-ha thanks, looks like there are some other packages that just get installed by d-i and are not included in standard_packages.pp [13:11:57] it's not a debian version thing, though, because other servers with the same debian version include hdparm [13:12:24] <_joe_> Krinkle: so I can just turn off redis now before it's too late and you can roolback? [13:12:55] _joe_: assuming mcrouter shows no problems, yes. [13:13:17] whether we want to try that on Friday is your call [13:13:28] btullis: I wonder if it's possible that a single package failed to install for (reasons) and the installation completed anyway? [13:14:15] I don't knwo if nutcracker will make noise if a host is unavailable when there are no incoming queries [13:14:33] <_joe_> Krinkle: nah let's do it next week [13:15:09] not unlike thumbor, it turns out the majority of activity on redis was essentially 404s [13:15:32] we're looking up randomly generated CP offsets even if there is no cookie incoming. [13:15:38] on every page view [13:15:49] chronology protector? [13:15:56] dhinus: It will get pulled in automatically by the ceph-base package, when that is installed. [13:15:58] the cookie being the cpPosIndex that tells us you had a previously saved offset in CP [13:15:59] yeah [13:16:14] i wasn't sure if you meant chronology protector, changeprop, or something else [13:16:18] https://www.irccloud.com/pastebin/KZN80u9w/ [13:16:18] :) [13:17:42] cdanis: when instantiating the CP class, we use the cookie if there is one and otherwise generate a random ID. If you make writes, we'll use the ID to save an offset to CP-store (now memcached, so far redis) and emit a cookie with that ID for it to find on the next req. [13:18:06] but the logic for reading the offset and potentially waiting for replication... doesnt care whether the ID came from a cookie or was just generated and thus can't possibly have aything in it [13:18:23] it's something like 0.1ms but still something [13:18:49] I've worked much harder to shave off 0.1ms [13:19:58] btullis: true, but still doesn't explain why it's installed in e.g. cloudcephosd1032 where ceph is not installed yet [13:20:38] entirely possible it was installed with cumin though, while debugging these new group of servers [13:22:20] I was expecting puppet to ensure the entire package list was in sync but maybe that would cause too many problems [13:23:21] puppet only adds packages, never removes them unless something is very explicit [13:23:40] <_joe_> dhinus: puppet's package resource is atomic, meaning it only handles things you explicitly name in your manifests [13:23:55] <_joe_> and every single package is treated as its own thing [13:24:17] you can see in standard_packages.pp linked earlier some ensure=>absent and some ensure=>purged [13:24:29] <_joe_> also not all providers of the package resource offer the instances interface that puppet would need to use to enforce global state [13:24:29] which is the only way puppet will remove a package, or in general, a resource [13:25:09] <_joe_> (for instance, we have package resources that use a provider different from "apt", the standard under debian - one example being our "scap3" provider) [13:25:18] thanks everyone, that's much clearer now :) [14:13:38] andrewbogott: That's true for every resource in puppet actually [14:14:07] If you declare a file as ensure => present, run the agent, and remove the declaration from the manifest, the file will stay [14:14:24] You need to explicitly define it as ensure => absent for puppet to remove the file [14:14:29] claime: true although we somewhat often actively absent resources in puppet defs but seldom absent packages. (Not really a fact about puppet, more about our use of it here) [14:14:42] andrewbogott: Fair enough :) [14:15:19] <_joe_> claime: with the exception of directory resources where you declare recurse => true and purge => true [14:15:24] andrewbogott: I picked up the habit at $lastjob to always ensure absent before removing, whatever the resource, but absenting packages [14:15:32] _joe_: yeah, with that exception [14:15:33] <_joe_> we do that for e.g. apache websites [14:15:39] good to know [14:16:17] I didn't finish my sentence lol "but absenting packages can have some weird side effects with dependencies" [14:17:00] my mindset is usually that if I'm about to absent a bunch of packages that probably means it's time to reimage instead :) [14:17:14] completely fair :) [14:21:26] I'm reviewing incidents for the Excellence monthly. Two of them are lacking some impact, and I'm filling something in for now. Will re-review next monday/tuesday before sending out. [14:21:27] - https://wikitech.wikimedia.org/wiki/Incidents/2022-07-03_shellbox_request_spike [14:21:37] Assumed based on Grafana and Logstash was the impact was. [14:29:14] - https://wikitech.wikimedia.org/wiki/Incidents/2022-07-11_Shellbox_and_parsoid_saturation [14:29:20] Assumed based on mobileapps dashboard [14:30:38] cc sobanski, arnoldokoth [14:31:52] Thanks! [14:35:59] - https://wikitech.wikimedia.org/wiki/Incidents/2022-07-12_codfw_A5_powercycle [14:36:43] this is the third one like it in 15 days time (linked others in See also). Might warrant an actionable. Not really sure, maybe it's a calculated risk intentionally done that way? [14:37:08] cc bblack XioNoX [14:37:08] there was an actionable [14:37:19] even if not properly documented [14:37:36] PDU work was moved from no impacting to graceful shutdown [14:37:53] it is more or less followed on the PDU maintenance tickets [14:38:14] ack, so the July 12th would be the last of its kind [14:39:29] I suggest talking to wiki_willy to suggest leaving a trail of the decision on the doc reports [14:39:50] my guess is that it was so obvious that it wasn't documented there :-D [14:40:19] but maybe it was on SRE meetings and tickets related to the pdu [14:40:25] so the action item was, to mark services as down before reboot [14:40:37] and ticket is T309957, same as the previous two? [14:40:38] T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 [14:40:38] no, it was decided to change the maintenance strategy [14:40:53] from no impact, to shutdown for a shorter period [14:41:10] and notify service owners beforehand, etc [14:41:21] ok mind if I add "PDU work was moved from no impacting to graceful shutdown for a shorter period." to the wiki page? [14:41:29] again, talk to wiki_willy :-D [14:41:42] I was on vacation, so that is a heresay [14:42:04] ack [14:42:20] the other people you can contact are the ICs [14:42:33] they own having the docs properly documented [14:42:56] so if you see gaps, they should be the point of contact to route to the people on the know [14:43:55] my guess is those were light reports because they had no user impact, but I think the last one was much more impacting [14:44:50] also, not sending you to wiki_willy on a void, he litterally asked me to route to him any comments about pdu maintenance :-D [14:48:19] ack, already done :) [14:49:22] i'm also marking varnish/appserver overload cases that are part of larger patterns as 'final' assuming no specific review will happen there. [14:51:09] yeah, I belive that is handled on its own working group/ticket [14:51:57] remember also this should be helpful for tracking and further info: https://docs.google.com/spreadsheets/d/1EYbMt6xTCDBaWfrPgu8Z1a3CvYrxbH1uH4kVf8MOQfQ [14:55:03] thanks for doing the work, BTW [16:22:55] lmata: only a few drafts left at https://wikitech.wikimedia.org/wiki/Category:Incident_documentation_drafts now. Perhaps even short enough to be worth going through the ~3 old ones at some point [16:23:51] basically anything that was fairly well-written and more than 6 months old, I removed the draft questions, made sure stuff is on phab and tracked, and the moved to either 'in-review' or 'final'. [16:27:54] Thank you Krinkle, will take a look.