[06:54:17] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10JMeybohm) [07:32:40] Hi! is there a way to access a kubernetes dashboard, so I can get to see how filnk is doing? [07:33:28] effie: ^ ? [07:37:19] <_joe_> zpapierski: I'm not sure what you mean, we collect metrics from kubernetes but if you want flink-specific metrics you will need to build a specific dashboard [07:37:39] no, I mean this - https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/ [07:38:54] <_joe_> oh no we defintely don't use that [07:39:07] <_joe_> usually we get the same data in grafana though [07:39:21] <_joe_> or, I think you're able to ssh to deploy1002, right? [07:40:53] <_joe_> in that case, you can also use kubectl to get some of those data [07:42:13] <_joe_> zpapierski: what is the name of your deployment? We can cook up a dashboard for you I think if there isn't one already [07:43:19] <_joe_> zpapierski: so this https://grafana-rw.wikimedia.org/d/-D2KNUEGk/jayme-kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-pod=All is ~ what you were looking for [07:44:09] thanks _joe_ - I can log in and I'm using kubectl, just was wondering if web ui is available as well [07:44:19] thanks for the dashboard,it will be useful [07:44:37] <_joe_> that grafana dashboard is the minimum, but I expect flink to also emit its own metrics [07:46:20] <_joe_> zpapierski: ok our good jayme fixed the dashboards names, now you have the pod details one https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-pod=All [07:47:03] <_joe_> and the container details one https://grafana-rw.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-pod=flink-session-cluster-main-6bbb9d6cc-rhjv7&var-container=All [07:48:13] zpapierski: for flink, you can/should potentially create a dedicated dashboard (maybe "the internet" already has something for you to start from) [07:48:33] we have one for flink [07:48:48] but, it's mostly focusing on app metrics [07:48:57] ah, okay [07:52:17] <_joe_> zpapierski: we usually add at the bottom some graphs to also capture the status of the pods [07:52:57] <_joe_> see for instance https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid (the part under "saturation") [07:57:07] hmm, that makes sense [08:28:50] morning [09:00:30] 10serviceops, 10MW-on-K8s, 10Kubernetes: Only schedule mediawiki pods on nodes with non-spinning disks - https://phabricator.wikimedia.org/T288345 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This is done for mediawiki pods now. For improvement of the current solution I did create T288509 [09:25:15] hello folks [09:25:25] if you have time, I'd need a brainbounce [09:25:54] knative returns to me this error, that I got while testing on minikube (and it was expected, there is a workaround to skip it) [09:26:05] Unable to fetch image "docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-editquality:2021-07-28-204847-production": failed to resolve image to digest: Get "https://docker-registry.wikimedia.org/v2/": x509: certificate signed by unknown authority [09:26:26] I wasn't expecting it in "prod" fetching from docker-registry.wikimedia.org [09:26:57] (basically knative controller tries to get the uuid for a certain image tag the first time that it deploys it) [09:27:16] nice [09:27:38] you need the puppet ca in knative controller then to verify the cert of the registry [09:28:11] oh..wait...wikimedia.org [09:28:17] this is the part that puzzles me - that should be an external cert no? [09:28:20] you need some kind of ca then :) [09:28:20] yeah exactly [09:28:41] ca-certificates is not installed by default in containers [09:28:57] ahhhh this can explain [09:29:10] but you'd need egress as well I guess [09:29:30] I have no policies at all right now so a problem for later :P :P [09:29:34] hrhr [09:29:45] poor future Luca :) [09:30:03] try to imagine future Luca with PSP/ingress/egress/etc.. [09:30:37] I don't really envy him [09:31:03] (nor future Janis that will probably need to tell poor Luca what to do) [09:31:11] :D [09:31:30] going to add the ca-certficates package to the controller image, seems the best way forward [09:31:42] (the alternative is to mount the cert via volumes etc.. but it seems overkill) [09:32:22] in that case you can also add the puppet-ca debian package and use docker-registry.discovery.wmnet for consistency [09:36:30] do you mean in knative's config? [09:37:26] (in the meantime I am testing https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/711111) [09:38:01] I meant wherever "docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-editquality:2021-07-28-204847-production" comes from [09:38:32] ah, that's potentially from the objects [09:39:07] so you should definitly be using discovery.wmnet there to not pull from CDN/potentially be able to use dragonfly etc. [09:47:08] jayme: ahhh I just got what we do, the charts reference docker-registry.wikimedia.org but we override with discovery.wmnet in helmfile.d [09:47:22] something that of course I don't do in knative [09:47:28] perfect now I get it [09:47:41] I'll add the override in the next knative deployment [09:53:39] (that is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/711113 hopefully) [09:53:47] thanks a lot for all the info :) [11:12:26] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05Open→03Stalled [11:12:32] 10serviceops, 10SRE, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [11:12:41] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) [11:28:48] can somebody help me with kubernetes networking setup? Task managers (component taskmanager) in our flink app( service rdf-streaming-updater) don't seem to be able to communicate which each other (and they need to). What can I do to make sure they can? [13:03:43] <_joe_> elukey: never use docker-registry.wikimedia.org for an image on k8s [13:04:06] <_joe_> *always* use the internal registry url, docker-registry.discovery.wmnet [13:04:23] <_joe_> the external-facing registry is just a service to the public [13:04:39] <_joe_> and could be turned off at a second's notice if it creates issues [13:04:39] _joe_ sure I only missed the config in admin_ng [13:05:02] <_joe_> didn't we have a psp ensuring it? or was that just on toolsforge? [13:06:12] I am working a little outside psp right now, so this might have worked for that reason [13:06:39] (I add psp when needed, plus we don't currently have egress rules etc...) [13:06:45] (globalnetpolicies etc..) [13:06:58] working on it :) [13:07:25] _joe_: toolforge enforces that everything comes from docker-registry.tools.wmflabs.org, we don't allow anything directly from docker-registry.wikimedia.org [13:07:41] plus it's a validating webhook, not a psp [13:07:42] <_joe_> majavah: yeah I meant a psp restricting the use of a single registry [13:07:54] <_joe_> originally yuvi wrote it as an admission controller [13:08:03] <_joe_> oh it's still a webhook [13:09:07] yes, might be moved to opa gatekeeper when that replaces our psp:s, but for now it's a webhook [13:22:26] 10serviceops, 10MW-on-K8s, 10Shellbox, 10Patch-For-Review: Applications running on php-fpm in kubernetes fail to save the backtrace for their slowlog - https://phabricator.wikimedia.org/T288315 (10Joe) 05Open→03Resolved a:03Joe We now get proper slow logs from php-fpm and they can properly looked at... [14:09:53] 10serviceops, 10SRE, 10Traffic: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10aborrero) I also tried using the pywikibot upload script, with similar result. This time, however, the script mentions ` action 'upload', server said: ('internal_api_error_DBQueryError', '[d8e17... [14:29:44] _joe_: unfortunately we cant control such things via PSP. That, as well as "don't use :latest" always need webhooks [14:50:59] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10JMeybohm) [16:08:11] <_joe_> jayme: I never looked into it after writing that ammission controller with yuvi back in the day [16:08:37] _joe_: but that thing was never deployed then? [16:09:21] <_joe_> no it was for toolforge, we did that for kubernetes... 1.0 or something :P [16:09:29] ah, okay [16:09:45] <_joe_> then I completely lost track of it, I remembered it was converted to smth else [16:10:11] there are some options now which we likely will explore when we need to move from PSP to * anyways [16:11:23] <_joe_> I love how we know there won't be ingress or psp eventually [16:11:31] <_joe_> we've been told they're deprecated [16:11:42] <_joe_> but we can't know what is coming exactly [16:14:59] * jayme throws confetti [16:15:12] * jayme quietly leaves the room [16:21:59] I don't always read this channel, but when I do I like what I see [17:54:07] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) a:05Cmjohnson→03RobH [18:47:21] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [18:54:17] 10serviceops, 10SRE, 10Traffic, 10Datacenter-Switchover, 10Patch-For-Review: Services without a service IP cannot automatically be switched by the switchdc cookbook - https://phabricator.wikimedia.org/T285707 (10Legoktm) p:05High→03Medium [19:12:27] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['mc1039.eqiad.wmnet', 'mc1040.eqiad.w... [19:14:14] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [20:04:03] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [20:14:18] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['mc1041.eqiad.wmnet', 'mc1042.eqiad.w... [20:19:58] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Install wiki-specific php extensions in the mediawiki production image - https://phabricator.wikimedia.org/T285309 (10Krinkle) [20:20:03] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Krinkle) [20:45:21] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1041.eqiad.wmnet', 'mc1042.eqiad.wmnet', 'mc1043.eqiad.wmnet', 'mc1044.eqiad.wmnet... [21:18:09] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['mc1046.eqiad.wmnet', 'mc1047.eqiad.w... [21:19:44] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [21:57:46] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1046.eqiad.wmnet', 'mc1047.eqiad.wmnet', 'mc1048.eqiad.wmnet', 'mc1049.eqiad.wmnet... [22:12:26] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) 05Open→03Resolved