[00:39:38] 10serviceops, 10MediaWiki-extensions-Score, 10Shellbox: Score ocassionally gets a 503 response from Shellbox - https://phabricator.wikimedia.org/T287288 (10Legoktm) [00:40:06] I posted some more log analysis on ^ but I'm not sure where to look next [05:11:22] 10serviceops, 10DBA, 10Toolhub, 10Patch-For-Review: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10Marostegui) @bd808 the database is in place and so are the grants. Please note that I have not been able to test the k8s ones, as I don't know which one you'd be using, b... [05:22:38] 10serviceops, 10MediaWiki-extensions-Score, 10Shellbox: Score ocassionally gets a 503 response from Shellbox - https://phabricator.wikimedia.org/T287288 (10Joe) This is a problem we've encountered before - what I think happens is that envoy (mw side) establishes a persistent connection to envoy (sh side), w... [05:23:49] 10serviceops, 10Shellbox: php-fpm for shellbox slow log error failed to ptrace(ATTACH) - https://phabricator.wikimedia.org/T288315 (10Joe) Indeed this is surely a problem for mediawiki as well. The bug is more general and we need to fix it for sure. [05:32:41] 10serviceops, 10MW-on-K8s, 10Shellbox: php-fpm for shellbox slow log error failed to ptrace(ATTACH) - https://phabricator.wikimedia.org/T288315 (10Joe) p:05Triage→03High The problem is that we're not running these containers with the `SYS_PTRACE` capability. I am not sure how that is accomplished on kube... [05:33:35] 10serviceops, 10MW-on-K8s, 10Shellbox: Applications running on php-fpm in kubernetes fail to save the backtrace for their slowlog - https://phabricator.wikimedia.org/T288315 (10Joe) [05:48:55] 10serviceops, 10MediaWiki-extensions-Score, 10Shellbox, 10Patch-For-Review: Score ocassionally gets a 503 response from Shellbox - https://phabricator.wikimedia.org/T287288 (10Joe) Also interestingly we had quite a few actual errors from shellbox, see https://logstash.wikimedia.org/goto/2a88f42efee0ce1edea... [06:10:00] hello folks [06:10:33] the kfserving chart has an interesting issue, and any suggestion from you would be great to understand how to best solve it [06:11:58] the chart creates a Secret with TLS credentials (for the webhook), but when I use helmfile sync blabla what happens is that helm tries to spin up the kfserving manager container (that runs the webhook) first, that hands in ContainerCreation due to the missing Secret [06:12:11] s/hands/hangs [06:12:44] the problem does not present itself with kubectl apply since all resources are applied without waiting [06:13:29] one possible solution could be to create a separate chart for the secret and put it as dependency, but it seems overkill [06:13:44] at the same time I don't have other ideas :D [06:20:59] jayme: o/ added https://wikitech.wikimedia.org/wiki/Helm#Testing_a_new_version as summary of what we discussed yesterday (basically what I did for helm3) [06:42:17] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) Thanks <3 [06:51:55] thanks elukey [06:58:04] np! [06:58:11] what I meant with my blurb before was https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/710481 [06:58:27] doesn't look pretty but I am not sure how to better solve it [07:07:51] hmm...that's weird. What you could try to untangle is: Rename the secret.yaml to 00_secret.yaml (IIRC the templates are just rendered and applied in order) or, add the "helm.sh/hook: pre-install" annotation to the secret. That should force helm to apply it before creating the rest of the objects [07:08:30] maybe the latter it the "right" way of doing it [07:15:00] ahh interesting so it may be sufficient to just add ""helm.sh/hook": "pre-install"" amont the secret's annotations? [07:16:33] TIL https://helm.sh/docs/topics/charts_hooks/ [07:23:36] jayme: something like this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/710483 ? So after the templates are rendered, but before the rest (with ordering between namespace and secret) [07:25:03] oh, you create the namespace in the chart as well? That seems weird given that you'll have to already have tiller in there [07:25:14] apart from that, yes [07:25:40] I [07:26:12] I'd assume that applying that would fail as the tiller service account should not have permissions to change a namespace object [07:26:16] or create one [07:26:18] I don't use tiller but helm3 directly [07:26:26] ah, dang. Sure [07:26:46] I wanted to move the namespace creation to helmfile but I am really scared about those labels [07:26:59] are they needed? Maybe not, maybe everything falls apart without them :D [07:27:11] you'll find out :P [07:27:27] thanks for the tip, will try to see if it works :) [07:27:36] 10serviceops, 10SRE Observability (FY2021/2022-Q1), 10User-jijiki: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 (10fgiunchedi) p:05Low→03Medium This is significantly spammy e.g. on Bullseye hosts too (see below), nudging into o11y Q1 ` Aug 6 07:26:19 thanos-fe200... [07:28:13] but you can always extend helmfile to allow to specify labels for namespaces. Which would be a good thing anyways as it would be nice to have some kind of "name" label on namespaces to be able to filter on those (like in calico rules for example) [07:29:45] I can try to follow up on that and see how it goes [07:33:34] 10serviceops, 10DBA, 10Toolhub, 10database-backups, 10Patch-For-Review: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10jcrespo) @bd808 I've deployed the changes to start backing up toolhub as part of m5 backups. Please ping me back the following tuesday to when you h... [07:36:29] 10serviceops, 10DBA, 10Toolhub, 10database-backups, 10Patch-For-Review: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10Marostegui) @bd808 feel free to close this task whenever you think it is all good from your side with the initial setup [07:52:38] 10serviceops, 10MediaWiki-extensions-Score, 10Shellbox: Score ocassionally gets a 503 response from Shellbox - https://phabricator.wikimedia.org/T287288 (10Legoktm) >>! In T287288#7265748, @Joe wrote: > This is a problem we've encountered before - what I think happens is that envoy (mw side) establishes a p... [07:55:22] 10serviceops, 10MW-on-K8s, 10Shellbox: Applications running on php-fpm in kubernetes fail to save the backtrace for their slowlog - https://phabricator.wikimedia.org/T288315 (10Legoktm) >>! In T288315#7265752, @Joe wrote: > The problem is that we're not running these containers with the `SYS_PTRACE` capabili... [07:59:30] so the hooks looks doing what they are meant, but now I get again the namespace kfserving-system is already created bla bla (even if from the hook it should be created by the yaml config) [07:59:35] going to investigate [08:11:56] ok no it worked, it was probably a temporary hiccup [08:12:17] the chart looks deploying, BUT I of course forgot the stuff for the IP SANs in kfserving [08:12:20] * elukey cries in a corner [08:39:35] kfserving-system kfserving-controller-manager-0 1/1 Running 0 33s [08:39:41] \o/ \o/ \o/ [08:47:08] elukey: :))) [08:48:32] elukey: you know a not-so-distant day we might be running kfserving on the main clusters as well, allowing people to submit "lambda" services quicker than they can now... [08:48:43] We'll just piggyback on your work [08:49:39] I hope it will not be a horror movie for you but something good :D [08:50:10] at this point the ML team is probably ready to test one ores model on top of kubeflow [08:50:16] wow [08:50:27] I am honestly impressed with your work [08:51:17] thanks! A lot of work ahead for all of us, but we are starting to see the light at the end of the tunnel :D [08:51:35] yeah and having something running in production is the way you learn [08:52:07] Andy and Kevin worked on the Ores models docker images, at some point we'd like to show what we have so we can review if any change is needed or not [08:52:22] elukey: are they built with blubber? [08:52:24] they are quite big, ~2G each, but they contain a lot of things [08:52:25] yes yes [08:52:46] yeah so you might find it useful to use dragonfly as well with 2G images... cc jayme [08:52:51] like https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-articlequality/tags/ etc.. [08:53:10] so the models for now are inside the image [08:53:17] nono they are on swift [08:53:29] 2G is only for aspell, python deps, etc.. [08:53:30] :( [08:53:31] so why is the image 2GB? [08:53:37] oh damn aspell yes [08:53:58] we can probably cut some 100s of MBs with some clever tricks like removing all docs [08:54:12] and it might be a good idea to have a base image in production-images [08:54:26] dcan you show me the pipeline configs? [08:54:35] at the moment we have a single image with all aspell packages (like we have in prod now for ores) that can be re-used everywhere, the alternative is to craft smaller images for each combination of model/language etc.. [08:54:53] yep lemme find the repo [08:55:23] https://gerrit.wikimedia.org/r/admin/repos/machinelearning/liftwing/inference-services [08:55:46] this is all preliminary stuff, Andy and Kevin are the best poc, I can try to answer some questions if any [08:56:15] (we use blubbler and t he wmf buster base image) [08:57:20] one thing that I was wondering, and I am not sure what's best, is how to package python deps. At the moment the build process involves the use of pip, that opens some question marks about how to track security upgrades etc.. [08:57:49] anyway, nothing is set in stone, we are experimenting, and very open to honest feedback and alternative directions :) [09:02:49] easy enought to integrate the ml-cluster with the current dragonfly infra! [09:10:19] \o/ [09:22:47] so yeah I think building a base image is a good thing [09:23:02] as for the use of pip - sadly we don't have a good artifact repository yet [09:23:09] gitlab comes with one though IIRC [09:23:46] so we can think of having a process to upload new artifacts (like wheels) and the serve pip from our own repository [09:24:13] ack [09:25:14] so IIUC we should put the big 2G image into production-image (rather than using blubber configs) and then use it as base image in the inference repo [09:25:21] *production-images [09:37:15] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) [12:21:47] elukey: did you update helm3 but not helm3-diff on deploy1002? [12:21:53] because it's now broken [12:22:18] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10JMeybohm) Dragonfly rolled out and active on main and staging clusters in eqiad and codfw. [12:24:18] joe: elukey: helm-diff is a completely seperate (source) package [12:24:37] I know, but somehow helm3 is not finding it [12:24:45] hmm...strange [12:25:12] what I wanted to say is: I was not aware that they are tightly coupled (they where not in the past) [12:25:33] I can take a look in ~30m if that's enough [12:27:33] yeah I got some strange results when trying to use "apply" [12:57:04] here I am sorry, did I forget to do something with helm?? [12:58:18] elukey: not sure what went wrong earlier, now helm3 diff seems to work [12:58:45] joe: ack if I broke something or if I need to follow up lemme know! [13:04:39] 10serviceops, 10MW-on-K8s: Only schedule mediawiki pods on nodes with non-spinning disks - https://phabricator.wikimedia.org/T288345 (10Joe) [13:07:11] 10serviceops, 10MW-on-K8s: Only schedule mediawiki pods on nodes with non-spinning disks - https://phabricator.wikimedia.org/T288345 (10Joe) For now I could limit mwdebug to run on physical nodes with node number higher than 4, but I'd think we might want this annotation for future use too [13:27:55] joe: so the diff now works? [13:27:59] * jayme confised [13:28:04] *confused [13:31:49] jayme: so I had some bad behaviour, but I finally got what I wanted, so I guess the next person working with the admin_ng stuff will find out [13:32:23] is there a way to repro?? [13:32:31] I can try to work on it during the next days [13:42:20] elukey: no idea! [13:43:22] 10serviceops, 10MW-on-K8s: Only schedule mediawiki pods on nodes with non-spinning disks - https://phabricator.wikimedia.org/T288345 (10JMeybohm) We could maybe use the well known annotation `node.kubernetes.io/instance-type` [1] for this and come up with some type definitions for vm's and hardware nodes. It w... [13:43:56] 10serviceops, 10MW-on-K8s, 10Kubernetes: Only schedule mediawiki pods on nodes with non-spinning disks - https://phabricator.wikimedia.org/T288345 (10JMeybohm) [16:13:28] 10serviceops, 10Peek, 10Security-Team, 10user-sbassett: Decommission peek2001 VM - https://phabricator.wikimedia.org/T288290 (10Dzahn) a:03Dzahn ACK, I can do this. [16:38:52] 10serviceops, 10Peek, 10Security-Team, 10Patch-For-Review, 10user-sbassett: Decommission peek2001 VM - https://phabricator.wikimedia.org/T288290 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `peek2001.codfw.wmnet` - peek2001.codfw.wmnet (**PASS**) - Down... [16:44:21] 10serviceops, 10Peek, 10Security-Team, 10Patch-For-Review, 10user-sbassett: Decommission peek2001 VM - https://phabricator.wikimedia.org/T288290 (10Dzahn) 05Open→03Resolved done! thanks to the decom cookbook this VM is already deleted and removed from DNS now I removed it from site.pp and DHCP but N... [17:52:11] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [17:57:35] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [18:16:44] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [18:17:08] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) the on-site specific work has been completed [18:54:52] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) decom script executed and servers removed from racks for mw1261-1266 rack A5 mw1269-1275... [19:30:22] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson)