[00:37:31] * Krinkle has turned the lights on for Drmrs on the Wikitech cluster map at https://wikitech.wikimedia.org/wiki/Template:ClusterMap [06:48:49] <_joe_> bblack: I radically disagree with "accepting it's a reality" as a policy. That's what got us away from the community and on slack in the first place, as an org. [06:49:36] <_joe_> I do think there are cases where a production setup gives no advantage and a lot of disadvantages though, compared to a setup in a public cloud, and I agree we should mark them clearly and have precise guidelines [06:50:35] <_joe_> now, in the past we would've presented such a policy proposal as an RfC, discuss it, approve it. Now I don't think there's any venue available to have such a debate formally. The Technical Decision Forum isn't suited for such things either, and I don't think it's authoritative [08:02:39] opened a task to discuss if an upgrade to Kafka 2.x is worth, whoever is interested is welcome :) https://phabricator.wikimedia.org/T300102 [08:45:39] [we had the cloud argument a lot at $JOB[1]; from a cost perspective, if we could keep our on-prem compute & storage pretty "busy" it was a lot cheaper thank cloudy solutions (though we ran our own OpenStack). There were always people who insisted they needed the very latest AWS GPU offerings, and it's very easy to run up huge bills thus. Tediously, you basically pay extra for someone else to have the surge capacity] [09:55:37] heads up all, im abut to deploy a change which updates the prometheus ferm rules, this should be a no-op but if you see any issues let me know (https://gerrit.wikimedia.org/r/c/operations/puppet/+/757010) [10:12:00] jbond: that means we now allow connections to prometheus endpoints from anywhere and not just collectors? [10:13:33] XioNoX: yes exactly sorry should have also mentioned that, see: https://gerrit.wikimedia.org/r/c/operations/puppet/+/756988 [10:14:22] sorry to clarify it means we now alloo prometheus hosts to connect to any port so we dont need prometheus specific rules [10:15:06] the endpoints will still be resticted [10:16:20] nice! [10:16:30] great cleanup indeed! [10:17:06] thx [10:24:01] indeed, thanks jbond ! [12:09:10] At the risk of starting another lively conversation 🙂 does anyone have any experience with or opinions on podman? https://podman.io/ [12:11:18] For context, we in the analytics team are currently trying to evaluate vaious different data-catalog solutions (atlas, datahub, openmetadata, egeria etc.) [12:12:40] These are all pretty tightly integrated with a number of back-ends, so naturally most of the dev and some of the deployment environments recommend (or in one case stipulate) a container based deployment. [12:15:06] We can test this out safely in WMCS, but sadly we don't have a fully functioning Kerberos/Hadoop/Hive etc. environment in cloud services. So we're trying to work around this by using resources in the test cluster and working around the lack of a docker (or other) container management solution. It's slow going. [12:17:12] We were wondering what the view from the wider SRE team about potentially using rootless docker (i.e. possibly podman) to help us to evaluate the various options. [12:18:50] If we wished/needed to use any third-party container images, is there a process whereby we could have these vetted and then mirror them to our own docker repository? Or do we strictly have to build all images ourselves? [12:19:11] btullis: I don't know anything about your question, but the subteam that knows more about containarization in production will be #wikimedia-serviceops [12:19:43] jynus: Thanks. I will ask there too. [12:21:08] btullis: there was a conversation in the past to add Kubeflow to the stat boxes, Alex added some comments about rootless docker in https://phabricator.wikimedia.org/T275551#6928702. [12:22:59] as far as I know we vet the docker images by building them in our environment (see for example the production-images repo and the blubber docs etc..) [12:23:32] I went through the same process to import all parts of the kubeflow/kserve stack (istio, knative, kserve, ...) [12:23:43] it is tough but it definitely pays off in the long term [12:24:32] btullis: what is the main use case though? Replacing the test cluster or allowing people to test new things via docker? [12:24:37] (to understand the use case) [12:27:33] <_joe_> btullis: so, we have nothing against podman, but we're using docker at the moment inside kubernetes [12:28:15] <_joe_> btullis: I was considering using podman or kaniko for building our container images during CI using docker-pkg, for instance [12:28:20] Elukey: Not replacing the test cluster. Evaluating new software with access to kerberized Hadoop. Potentially deploying production software using containers too. [12:29:51] <_joe_> btullis: if your goal is running containers downloaded from external sources, we're currently not doing it, as elukey was explaining [12:30:52] btullis: ack, for it the only thing that I can think of is the dse-k8s cluster (the old name was train-wing) that we should get during the next couple of months (that will require a strict validation and import of docker images etc..) [12:30:54] <_joe_> the main reason for that is that we can't depend on external sources for security updates, unless these sources commit to a security update policy (like debian does) [12:32:53] <_joe_> so if your goal is to run software in containers, and you want to use containers not built with our pipeline, that would need a thorough discussion because it's against our docker image update policy (I don't find the link rn, moritzm might find it?) [12:32:56] Ok, thanks both. So podman ok for testing, but third party images not ok, (unless strict security policy is evident and vetted). [12:33:20] <_joe_> btullis: I'm happy to explain to you how to rebuild an image in our infra [12:34:04] <_joe_> and if you want to package podman for debian, I would be extremely happy :) [12:34:16] <_joe_> for buster I mean, I think it's in sid already [12:34:54] <_joe_> oh it's already in bullseye, heh [12:35:05] Yes, I was wondering about compiling a statically linked binary for testing. [12:35:51] podman in bullseye works, there is also a podman-docker package that provides /usr/bin/docker [12:35:54] docker-compose also works [12:36:00] (on top of podman) [12:36:13] it's not yet fully official and on wikitech but https://docs.google.com/document/d/1rP9dbErix7g_gp8AKBNBe-bgLEc3AIvOrDzYyBQCbB4/ should be accessible to anyone in the sre@wikimedia group (and happy to add anyone else as needed) [12:36:23] <_joe_> moritzm: thanks :) [12:37:07] <_joe_> paravoid: do you think it will be hard to backport to buster, in case? [12:38:46] you can, and it will probably work, but I would not recommend it [12:39:40] if you use crun, you need cgroups v2, which was opt-in in buster (but default in bullseye), cf. https://bugs.debian.org/943981 [12:40:12] you also need unprivileged user namespaces, which is also opt-in in buster but default in bullseye [12:40:29] I ran into some other bugs in the past that required newer libseccomp etc., e.g. https://github.com/containers/crun/issues/545 [12:40:50] and https://github.com/containers/crun/issues/530 [12:41:57] everything works out of the box with bullseye, so I would recommend that [12:42:34] Thanks all I'm sure we can deploy bullseye for this. Just wondered what our options are for this prototyping phase. [12:43:00] btw rootless in bullseye is still going to be pretty slow [12:43:25] both fuse-overlayfs (disk i/o) and slirp4netns (net i/o) are going to be slow [12:44:03] with linux 5.11+ (e.g. bookworm), there is support for unprivileged (kernel) overlayfs, so fuse-overlayfs will not be required anymore [12:44:20] (I have that running right now on my laptop with sid's kernel, 5.15) [12:46:18] (disclaimer: I have no opinions right now on what tech stack to run in production and how, so don't take any of this as a recommendation. I've been playing around with podman for my own purposes and sharing my experiences - HTH! :) [12:47:55] Yes, all very helpful. Many thanks all. [14:27:29] podman> I've heard good things (I think a bunch of Ceph deployments-via-container are using it in preference to docker) [15:02:18] <_joe_> paravoid: oh right rootless podman would be really hard on buster [15:02:40] volans: today i discover that i can't do nested f-strings in python, and i'm having a sad :( [15:02:42] <_joe_> as in, requiring extensive modification [15:03:19] <_joe_> kormat: nested f-strings, you monster, do you hate all the future readers of your code so much? [15:03:44] do you even have to ask? [15:03:50] <_joe_> (that was a silly question, yes) [15:05:08] volans: oh, wait, it's allowed provided you use different quotes for the inner f-string? ಠ_ಠ [15:05:41] `logging.debug(f"Results: {', '.join([f'{k}:{results[k]}' for k in results])}")` [15:06:55] * volans can't unread the above line [15:07:34] \o/ [15:08:24] also seems wrong, but I would not go there [15:08:33] volans: oh no, please do! [15:08:53] what type is results? [15:09:04] OrderedDict[str, str] [15:11:50] 1) do you really need an ordered dict in the first place? since 3.7 all dicts are guaranteed to be insertion-ordered [15:12:58] actually in this case the ordered bit is irrelevant, as the results are gathered in parallel, so there's no relevant order to the insertions, so i'll revert that to a normal dict [15:16:20] 2) instead of {k}:{results[k]} why not {k}:{v} for k, v in results.items()? [15:16:39] oh yeah, that's much neater [15:17:23] hah. in fact, now that it's no longer an ordered dict, `f"Results: {results}` is actually fine for this purpose [15:18:02] that was my 3rd point [15:18:09] that's not much different than just adding results [15:18:12] *printing [15:18:28] the OrderedDict str representation was much worse [15:19:24] hurm. i do want the results to be sorted, though. [15:19:35] feh. [15:20:22] dict(sorted(results.items())) :-p [15:20:35] * kormat winces [15:20:37] https://phabricator.wikimedia.org/P19339 [16:28:54] moritzm: can we share that container image security updates doc with wikimedia foundation? [16:30:41] ^ done, please revert if that is not okay. [16:35:04] it's not yet in a state where feedback outside of SRE is actively sought (e.g. the feedback provided so far isn't incorporated) [16:35:36] it will be in the near future, but not ATM [16:41:09] not for feedback, but for reference about why we can't use externally built docker images [16:41:22] lots of convos going on in Data Engineering planning about why we can't just use docker, etc. [16:41:36] more information i can link to is better [16:41:47] i made wmf only viewers, not commenters, but again, feel free to revert [16:51:25] <_joe_> ottomata: I would be happy to be involved in such conversations, btw, if I can bring some clarity. [16:52:36] _joe_: Thanks also. I'd like to take you up on that offer of going through the image rebuild process some time. [16:53:27] <_joe_> btullis: so some introductory logic is here https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Base_images [16:53:50] 👍 thank. [16:54:02] <_joe_> I'll aslo share with you the design document I wrote some time ago about the system we're (slowly) building to manage organically the whole thing [16:57:59] _joe_: good to know, i will send emil chetty your way :) [16:58:00] ty [19:28:33] when I follow the link to "And in logstash, with 5xx kibana dashboard" in https://wikitech.wikimedia.org/wiki/Logs#5xx_errors, I get "Application Not Found" from https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X [19:29:04] when I follow the link to "Grafana: Application RED dashboard" from an Incident report page I get a grafana page but "no data" [19:29:30] what is the better and working link to currently see the error rates please [19:37:35] eh, I'm ok, I found the working version of the RED dashboard when digging in grafana, just not on the frontpage and the link changed or so [19:40:29] anyone know what this is? [19:40:30] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, 'network::parse_abuse_nets' parameter 'abuse_nets' entry 'mx_blocked_nets' entry 'context' index 0 expects a match for Network::Context = Enum['ferm', 'phabricator', 'varnish'], got 'mx' (file: /etc/puppet/modules/base/manifests/firewall.pp, line: 50, column: 9) on node [19:40:30] cloudcontrol1005.wikimedia.org [19:41:58] herron: any chance that's your latest patch? ^^ [19:42:03] andrewbogott: that is on puppet run, right? [19:42:11] yes [19:43:17] aaaand now it seems better? [19:44:26] <+icinga-wm> PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 [19:44:34] 19:42 < jbond> puppet fixed now [19:44:50] affected all hosts because exim is everywhere [19:45:15] * jbond running puppet on failed nodes now [19:45:46] and its more related to ferm being every where [20:18:40] noticed while using apt: ERROR:debmonitor:Failed to execute DebMonitor CLI: 'NoneType' object has no attribute 'source_name' [20:18:44] (via cumin) [20:33:56] mutante: which host? [20:35:14] volans: for example this: maybe it is because I am using wildcard/glob on package names [20:35:17] sudo cumin mw139* 'apt-get -y remove --purge fonts* && apt-get -y remove --purge xfonts*' [20:35:47] I was kind of waiting for https://debmonitor.wikimedia.org/packages/fonts-kalapi to update for example [20:36:41] sorry, gotta go afk, will be back [20:36:55] mutante: can you give me a host where it needs to run and you didn't run it yet? [20:52:41] I couldn't repro on sretest trying to install and remove the same packages, feel free to open a task with the info you have and possibly a host where to test/repro it, thanks [21:22:06] WMF n00b here. jbond or anyone else, do you know if cergen works with the deployment-prep puppetmaster? [21:27:55] inflatador: i dont know im afraid [21:29:04] no worries, gehel recommended I hit you up. Based on taavi 's comment here it sounds like no one has tried yet: https://phabricator.wikimedia.org/T298252 [21:30:07] So I'll probably give it a shot later this week if I don't get a confirm or deny. Is there a particular mailing list you would recommend I ask about this? [21:37:19] inflatador: deployment-prep things are in scope for cloud@ and wikitech-l@ conversations. I wouldn't expect many on those lists to have done infra work in deployment-prep at all though. [21:44:13] Thanks bd808 , I'll reach out there and see if I get a response [21:45:53] inflatador: it does look like cergen has been used for some things in deployment-prep -- https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/refs/heads/master/modules/secret/secrets/certificates/certificates.manifests.d/deployment_prep.certs.yaml [21:46:52] inflatador: it looks like ottomata might be a person of interest in your investigations :) [21:48:37] Ah nice, git blame FTW [21:56:04] hello! [21:56:22] its been a while, but yes I have used cergen in deployment prep. [21:56:34] it should be installed on that puppetmaster, by puppet itself (i think) [21:56:40] so it should work just like it does in prod [21:56:42] IIRC [21:57:24] Thanks ottomata ! (And thanks for interviewing me ;P ) I'll give it a shot...there doesn't seem to be much risk involved AFAIK, let me know if not though! [21:57:50] :) [21:57:56] yeah no risk [21:58:01] it will either work or not :)