[07:11:51] TIL of this morning [07:12:17] the kfserving project, that is what we are testing for our dear models, is going to get promoted/re-branded to kserve [07:12:20] https://github.com/kserve/kserve [07:13:28] the funny part that I see is Knative+Istio optional, something that I didn't quite understand at the moment [07:13:39] <_joe_> wut [07:14:21] <_joe_> so all that work was for nothing, great [07:14:26] <_joe_> :D [07:14:53] <_joe_> it still needs istio ingress though [07:16:00] so knative is very nice for us, I think we'd be interested to use it anyway, but I didn't see any trace in the code of the fact that it can be optional [07:18:51] I asked some info to upstream [07:18:56] * elukey cries in a corner [07:22:16] _joe_ not sure if I already showed https://github.com/kubeflow/kfserving/tree/master/docs/samples/v1beta1/rollout to you but it looks nice [07:29:02] what the... :-o [07:30:12] but as joe said it still needs ingress gateway. So probably only knative is obsolete really? [07:34:31] I am not sure how knative is obsoleted, the code references it a lot IIRC [07:53:05] <_joe_> not obsolete, apparently optional in the first image, but then it seems required in all the rest of the docs [08:12:21] I have an envoy issue (I know, very original) [08:12:49] I get an error saying https://www.irccloud.com/pastebin/IPzn7GyF/ [08:13:37] the config is this one https://etherpad.wikimedia.org/p/effiee [08:15:01] any ideas would be high;y appreciated [08:27:30] and somebody keeps telling me that httpd's config is confusing :P [08:27:44] will not say any name [08:30:20] effie: there may be a typo or something, I'd suggest to remove bits of the config and reload until you find where it breaks [08:30:25] at least to narrow it down [08:31:34] effie: ahh wait, check "filter_chains" at 119 [08:32:36] seems to be aligned with socket_address [08:32:55] not sure if it is a paste issue or not [08:35:27] ah!!! [08:35:32] thank you, let me see [08:35:39] stupid error messages [08:36:27] yeye! [08:36:48] \o/ [08:37:05] tx tx [08:37:15] it is so nice to have config files in yaml [08:38:46] effie: may I ask 2 mins of your time in exchange for the fix? :D [08:38:51] it's related to systemd::coredump [08:39:06] now you are scratching and old wound [08:39:16] I know I know, I was reading https://phabricator.wikimedia.org/T236253 [08:39:27] I had a chat with Moritz since I keep seeing [08:39:37] Core dump to |/usr/lib/systemd/systemd-coredump 26789 33 33 11 1631848119 0 php-fpm7.2 pipe failed [08:39:57] so I am wondering one thing - is it ok if we install systemd-coredump as part of the class code? [08:40:03] it seems missing [08:40:10] (I can test it on one canary first of course) [08:40:24] and also no idea what is the status of the problem in that task [08:40:24] yes, it is on purpose, so, if we allow all servers to cordump [08:40:55] it will take time to dump eg 8G on disk [08:41:15] ah I see it is absented [08:41:33] we additionally had compression on when we tested it [08:41:42] mmm no it is enabled [08:41:44] so the server tha coredumped was useless [08:42:13] I kept wanting to go back and fix this and kept putting it off [08:42:26] so one idea is to have 1 canary server having coredumped enabled and installed [08:42:46] yes I think it is good, on both wtp and mw servers [08:42:54] I can help, I'd like to check those coredumps [08:43:12] but my main point is that now they are not generated [08:43:13] then there was a bigger picture, where we have this enabled/disabled across all servers or not [08:43:57] and easy terrible thing to do would be to install systemd-coredump on one server [08:44:12] and then remove it, if you want to have one right now [08:44:38] I am confused now though - the class + config is deployed everywhere, but the only thing that prevents those coredumps to be generated is the fact that we don't install systemd-coredump? [08:44:44] yes [08:44:54] ahhh ok [08:45:03] so if one really needs coredumps, installs it on a servr [08:45:18] not ideal, thus not advertised [08:45:31] we left it like that until we'd find a perm solution [08:48:09] if possible let's find one [08:49:49] when we have mw running on k8s we can simply enable a few pods with coredumps enabled [08:50:44] 10serviceops, 10SRE, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10elukey) To keep archives happy - we currently don't deploy `systemd-coredump` on our hosts (because of the reasons highlighted above), so the dumps are not... [08:52:06] 10serviceops, 10User-jijiki: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10elukey) [08:52:12] 10serviceops, 10SRE, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10elukey) [08:52:50] 10serviceops, 10observability, 10User-jijiki: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10elukey) [08:53:20] 10serviceops, 10observability, 10User-jijiki: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10elukey) Getting back to this, added the Observability tag to get some feedback from the team as well. [08:58:41] moritzm: I will create a task for that, that's a nice idea [08:59:18] luca and I will fix this coredump thing I left half done [08:59:23] my waterloo [09:00:43] ahahha [09:01:27] moritzm: I also found an old task in which c*danis was wondering if we should have a metric tracking segfaults (fleet wide), no idea if there was any progress, I tagged observability [09:01:52] ah yes! We have that metric! [09:02:41] but only for central log [09:02:43] mmm [09:03:49] nope wrong field :P [09:16:14] oh yes, having that fleet-wide would be a useful metric, I bet with automatic restarts handled by systemd there's a handful of segfaults we haven't even noticed :-) [09:17:00] moritzm: if you add more work, we will start charging you [09:19:29] moritzm: I think that we already have a mtail-based metric that reports segfaults, but the numbers don't add up from what I can see (logs vs metric). Moreover I see a lot of "traps: ... general protection .." in dmesg, may be useful to have as metric too [09:22:40] maybe the mtail script is off and doesn't all cases? [09:23:26] I am checking the regexes yes [09:53:18] 10serviceops, 10observability, 10Patch-For-Review, 10User-jijiki: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10elukey) It seems that we do have an mtail metric for segfaults, but when checking [[ https://thanos.wikimedia.org/graph?g0.expr=segfault%7Bhos... [10:00:19] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) My current plan for cluster-wise migration and deploying services with helm3 is: * make sure cluster is depooled * delete helm releases for all services * remove tiller compon... [10:00:44] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [10:01:45] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [12:07:00] 10serviceops, 10SRE: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10MoritzMuehlenhoff) Ack, I'll upload to apt.wikimedia.org on Monday. [13:35:13] "we are working on the migration script for kfserving-> kserve transition and kserve 0.7 release is going to support raw kubernetes deployment without knative/istio. (edited) " [13:36:04] at this point istio+knative still makes sense to us (as ML) after the initial investment, but it is a little frustrating [14:21:09] 10serviceops, 10Performance-Team, 10SRE, 10Thumbor, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Krinkle) Moving back for re-triage as it's been dormant 6 months in a columnn for things "this quarter".