[08:05:50] 10serviceops, 10Traffic, 10Datacenter-Switchover: Figure out what changes are needed in the traffic layer for having codfw be the r/w DC for half a year - https://phabricator.wikimedia.org/T337535 (10akosiaris) [08:06:09] 10serviceops, 10Traffic, 10Datacenter-Switchover: Figure out what changes are needed in the traffic layer for having codfw be the r/w DC for half a year - https://phabricator.wikimedia.org/T337535 (10akosiaris) p:05Triage→03Medium [08:39:09] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster [08:39:20] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster [08:39:30] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster [08:39:53] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster [09:08:11] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster executed with errors: - parse10... [09:23:19] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster completed: - parse1013 (**WARN*... [09:26:23] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster completed: - parse1014 (**WARN*... [09:28:22] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster completed: - parse1016 (**WARN*... [10:42:00] hello folks, I'd need an advice about how to proceed for deployment-chart's rake code [10:42:03] I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/923538 [10:42:15] not sure if it is the best way, but it solves the problem [10:42:40] the weird thing though is that I don't see a diff for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922583/7/helmfile.d/admin_ng/values/ml-staging-codfw/values.yaml [10:42:57] I expected helmfile_namespaces to have picked it up [10:44:21] (I'll also add the mins, realized that they may be required) [11:11:10] <_joe_> ok, let me take a look [12:30:28] back sorry, lemme know your thoughts :) [12:31:55] <_joe_> elukey: lgtm [12:32:09] <_joe_> it seems like a visualization bug before [12:34:11] <_joe_> but I need to check a couple things, I'm in lunch break but I'll get back to you before tonight [12:34:15] <_joe_> is that ok? [12:34:18] sure! [13:20:37] o/ jayme bump on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922874 , I'd love to be able to deploy that and hopefully https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922839 on monday [13:21:45] you have a day off on monday ;) [13:25:46] oh uh [13:25:48] tuesday [13:25:48] :) [13:25:49] ty [13:28:28] I'll need to double check on the dashboard and we should probably have some task about defining problem states (e.g. on what to alert) as well [13:29:19] indeed. dashboard still needs some work, will try to do more of that today. been working more on flink app dash https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1 [13:30:51] ah, I see. This is really more a full flink cluster dashboard :) [13:31:02] yup [13:31:18] operator dash still needs lots of work https://grafana-rw.wikimedia.org/d/H-sRgqLVk/flink-kubernetes-operator?orgId=1 [13:32:27] btw i waffled and decided to just use release in the dash variables to handle the multiple flink cluster apps in one namespace thing i was asking about in the k8s meeting [13:32:59] its a bit weird because a release == a cluster which usually also == 1 job, but flink can have multiple jobs in one cluster, so flink metrics don't report it that way [13:33:05] release is the only label that all metrics have [13:33:16] so both job_name and release are in the flink app dash variables [13:35:48] I don't follow [13:35:56] what is a dash variable in that context? [13:45:25] a grafana dashboard variable [13:45:46] i want this dash to be useable for any flink-app chart based deployment [13:49:20] my waffling is documented here https://phabricator.wikimedia.org/T337496 [13:49:20] :p [13:50:12] jayme: i think i'm a little confused about how admin_ng helmfile merges values [13:50:15] https://integration.wikimedia.org/ci/job/helm-lint/10765/console [13:50:27] i put watchNamespaces in main.yaml [13:50:27] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922874/5/helmfile.d/admin_ng/values/main.yaml [13:50:38] expecting it to be now set for all main (wikikube) cluster groups [13:51:59] but it was removed for staging-eqiad [13:54:10] i don't suppose there is a way to set cluster_group wide values for a specific release? (flink-operator) ? [14:02:03] moved watchNamespaces back to /flink-operator-values.yaml, fixed it. [14:07:32] See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/README.md for how values are sourced [14:46:24] reading the readme...:) [14:46:26] qq [14:46:28] The cluster_group (if any) is defined in the clusters values.yaml at values/< .Environment.Name >/values.yaml [14:46:30] seems to be wrong? [14:46:45] it is values/<.Environment.Name>.yaml ? [14:48:14] jayme: do we need a values/main/values.yaml? [14:48:53] hm, yeah i think these docs are wrong? [14:50:23] <_joe_> ottomata: if you're talking about admin_ng, your environment is the name of the k8s cluster [14:51:42] yes [14:51:43] but [14:51:47] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/README.md#basic-releases [14:51:47] says [14:52:10] oh i see, yes i think i pasted the wrong thing [14:52:25] releases values: They can be overridden per cluster_group values/< .Values.cluster_group >/values.yaml [14:53:06] <_joe_> ottomata: maybe you need to take the time to figure it out :) [14:53:08] there are no values//values.yaml files? [14:53:52] _joe_: i understand the difference between cluster groups and k8s_cluster == environment [14:55:03] and, this question came about because I was trying to set a release (flink-operator) value for the main (wikikube) cluster group in values/main.yaml [14:55:14] and it did not seem to work correctly (the value was not applied for the release) [14:55:20] at least, in staging [14:55:42] so, probably I was doing somethign wrong ^, but these docs do seem to be partially incorrect? [14:57:15] <_joe_> elukey: I think I have the correct fix for the problem [15:04:01] super, I'll wait CI and then I'll merge all [15:32:03] 10serviceops, 10SRE-OnFire, 10Traffic, 10conftool, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10JMeybohm) a:03BBlack As you seem to be working on this I'm bluntly assign... [16:09:21] 10serviceops, 10PyBal, 10Release-Engineering-Team, 10SRE, and 4 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) An initial draft of a postmortem for this issue has been posted at: https://w... [19:00:05] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) [[ https://grafana-rw.wikimedia.org/d/H-sRgqLVk/flink-kubernetes-operator?orgId=1&from=now-7d&to=now&var...