[07:11:51] <elukey>	 TIL of this morning
[07:12:17] <elukey>	 the kfserving project, that is what we are testing for our dear models, is going to get promoted/re-branded to kserve
[07:12:20] <elukey>	 https://github.com/kserve/kserve
[07:13:28] <elukey>	 the funny part that I see is Knative+Istio optional, something that I didn't quite understand at the moment
[07:13:39] <_joe_>	 wut
[07:14:21] <_joe_>	 so all that work was for nothing, great
[07:14:26] <_joe_>	 :D
[07:14:53] <_joe_>	 it still needs istio ingress though
[07:16:00] <elukey>	 so knative is very nice for us, I think we'd be interested to use it anyway, but I didn't see any trace in the code of the fact that it can be optional
[07:18:51] <elukey>	 I asked some info to upstream
[07:18:56] * elukey cries in a corner
[07:22:16] <elukey>	 _joe_ not sure if I already showed https://github.com/kubeflow/kfserving/tree/master/docs/samples/v1beta1/rollout to you but it looks nice
[07:29:02] <jayme>	 what the... :-o
[07:30:12] <jayme>	 but as joe said it still needs ingress gateway. So probably only knative is obsolete really?
[07:34:31] <elukey>	 I am not sure how knative is obsoleted, the code references it a lot IIRC
[07:53:05] <_joe_>	 not obsolete, apparently optional in the first image, but then it seems required in all the rest of the docs
[08:12:21] <effie>	 I have an envoy issue (I know, very original)
[08:12:49] <effie>	 I get an error saying  https://www.irccloud.com/pastebin/IPzn7GyF/
[08:13:37] <effie>	 the config is this one https://etherpad.wikimedia.org/p/effiee
[08:15:01] <effie>	 any ideas would be high;y appreciated 
[08:27:30] <elukey>	 and somebody keeps telling me that httpd's config is confusing :P
[08:27:44] <elukey>	 will not say any name
[08:30:20] <elukey>	 effie: there may be a typo or something, I'd suggest to remove bits of the config and reload until you find where it breaks
[08:30:25] <elukey>	 at least to narrow it down
[08:31:34] <elukey>	 effie: ahh wait, check "filter_chains" at 119
[08:32:36] <elukey>	 seems to be aligned with socket_address
[08:32:55] <elukey>	 not sure if it is a paste issue or not
[08:35:27] <effie>	 ah!!!
[08:35:32] <effie>	 thank you, let me see
[08:35:39] <effie>	 stupid error messages
[08:36:27] <effie>	 yeye!
[08:36:48] <elukey>	 \o/
[08:37:05] <effie>	 tx tx 
[08:37:15] <elukey>	 it is so nice to have config files in yaml
[08:38:46] <elukey>	 effie: may I ask 2 mins of your time in exchange for the fix? :D
[08:38:51] <elukey>	 it's related to systemd::coredump
[08:39:06] <effie>	 now you are scratching and old wound 
[08:39:16] <elukey>	 I know I know, I was reading https://phabricator.wikimedia.org/T236253
[08:39:27] <elukey>	 I had a chat with Moritz since I keep seeing
[08:39:37] <elukey>	 Core dump to |/usr/lib/systemd/systemd-coredump 26789 33 33 11 1631848119 0 php-fpm7.2 pipe failed
[08:39:57] <elukey>	 so I am wondering one thing - is it ok if we install systemd-coredump as part of the class code?
[08:40:03] <elukey>	 it seems missing 
[08:40:10] <elukey>	 (I can test it on one canary first of course)
[08:40:24] <elukey>	 and also no idea what is the status of the problem in that task
[08:40:24] <effie>	 yes, it is on purpose, so, if we allow all servers to cordump 
[08:40:55] <effie>	 it will take time to dump eg 8G on disk 
[08:41:15] <elukey>	 ah I see it is absented
[08:41:33] <effie>	 we additionally had compression on when we tested it 
[08:41:42] <elukey>	 mmm no it is enabled
[08:41:44] <effie>	 so the server tha coredumped was useless
[08:42:13] <effie>	 I kept wanting to go back and fix this and kept putting it off 
[08:42:26] <effie>	 so one idea is to have 1 canary server having coredumped enabled and installed
[08:42:46] <elukey>	 yes I think it is good, on both wtp and mw servers
[08:42:54] <elukey>	 I can help, I'd like to check those coredumps
[08:43:12] <elukey>	 but my main point is that now they are not generated
[08:43:13] <effie>	 then there was a bigger picture, where we have this enabled/disabled across all servers or not 
[08:43:57] <effie>	 and easy terrible thing to do would be to install systemd-coredump  on one server 
[08:44:12] <effie>	 and then remove it, if you want to have one right now
[08:44:38] <elukey>	 I am confused now though - the class + config is deployed everywhere, but the only thing that prevents those coredumps to be generated is the fact that we don't install systemd-coredump? 
[08:44:44] <effie>	 yes 
[08:44:54] <elukey>	 ahhh ok
[08:45:03] <effie>	 so if one really needs coredumps, installs it on a servr
[08:45:18] <effie>	 not ideal, thus not advertised
[08:45:31] <effie>	 we left it like that until we'd find a perm solution 
[08:48:09] <elukey>	 if possible let's find one
[08:49:49] <moritzm>	 when we have mw running on k8s we can simply enable a few pods with coredumps enabled
[08:50:44] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10elukey) To keep archives happy - we currently don't deploy `systemd-coredump` on our hosts (because of the reasons highlighted above), so the dumps are not...
[08:52:06] <wikibugs>	 10serviceops, 10User-jijiki: Measure  segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10elukey)
[08:52:12] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10elukey)
[08:52:50] <wikibugs>	 10serviceops, 10observability, 10User-jijiki: Measure  segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10elukey)
[08:53:20] <wikibugs>	 10serviceops, 10observability, 10User-jijiki: Measure  segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10elukey) Getting back to this, added the Observability tag to get some feedback from the team as well.
[08:58:41] <effie>	 moritzm: I will create a task for that, that's a nice idea
[08:59:18] <effie>	 luca and I will fix this coredump thing I left half done  
[08:59:23] <effie>	 my waterloo 
[09:00:43] <elukey>	 ahahha
[09:01:27] <elukey>	 moritzm: I also found an old task in which c*danis was wondering if we should have a metric tracking segfaults (fleet wide), no idea if there was any progress, I tagged observability
[09:01:52] <elukey>	 ah yes! We have that metric!
[09:02:41] <elukey>	 but only for central log
[09:02:43] <elukey>	 mmm
[09:03:49] <elukey>	 nope wrong field :P
[09:16:14] <moritzm>	 oh yes, having that fleet-wide would be a useful metric, I bet with automatic restarts handled by systemd there's a handful of segfaults we haven't even noticed :-)
[09:17:00] <effie>	 moritzm: if you add more work, we will start charging you 
[09:19:29] <elukey>	 moritzm: I think that we already have a mtail-based metric that reports segfaults, but the numbers don't add up from what I can see (logs vs metric). Moreover I see a lot of "traps: ... general protection .." in dmesg, may be useful to have as metric too
[09:22:40] <moritzm>	 maybe the mtail script is off and doesn't all cases?
[09:23:26] <elukey>	 I am checking the regexes yes
[09:53:18] <wikibugs>	 10serviceops, 10observability, 10Patch-For-Review, 10User-jijiki: Measure  segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10elukey) It seems that we do have an mtail metric for segfaults, but when checking [[ https://thanos.wikimedia.org/graph?g0.expr=segfault%7Bhos...
[10:00:19] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) My current plan for cluster-wise migration and deploying services with helm3 is:   * make sure cluster is depooled  * delete helm releases for all services  * remove tiller compon...
[10:00:44] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto)
[10:01:45] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto)
[12:07:00] <wikibugs>	 10serviceops, 10SRE: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10MoritzMuehlenhoff) Ack, I'll upload to apt.wikimedia.org on Monday.
[13:35:13] <elukey>	 "we are working on the migration script for kfserving-> kserve transition and kserve 0.7 release is going to support raw kubernetes deployment without knative/istio. (edited) "
[13:36:04] <elukey>	 at this point istio+knative still makes sense to us (as ML) after the initial investment, but it is a little frustrating
[14:21:09] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Thumbor, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Krinkle) Moving back for re-triage as it's been dormant 6 months in a columnn for things "this quarter".