[08:05:50] <wikibugs>	 10serviceops, 10Traffic, 10Datacenter-Switchover: Figure out what changes are needed in the traffic layer for having codfw be the r/w DC for half a year - https://phabricator.wikimedia.org/T337535 (10akosiaris)
[08:06:09] <wikibugs>	 10serviceops, 10Traffic, 10Datacenter-Switchover: Figure out what changes are needed in the traffic layer for having codfw be the r/w DC for half a year - https://phabricator.wikimedia.org/T337535 (10akosiaris) p:05Triage→03Medium
[08:39:09] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster
[08:39:20] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster
[08:39:30] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster
[08:39:53] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster
[09:08:11] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster executed with errors: - parse10...
[09:23:19] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster completed: - parse1013 (**WARN*...
[09:26:23] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster completed: - parse1014 (**WARN*...
[09:28:22] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster completed: - parse1016 (**WARN*...
[10:42:00] <elukey>	 hello folks, I'd need an advice about how to proceed for deployment-chart's rake code
[10:42:03] <elukey>	 I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/923538
[10:42:15] <elukey>	 not sure if it is the best way, but it solves the problem
[10:42:40] <elukey>	 the weird thing though is that I don't see a diff for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922583/7/helmfile.d/admin_ng/values/ml-staging-codfw/values.yaml
[10:42:57] <elukey>	 I expected helmfile_namespaces to have picked it up
[10:44:21] <elukey>	 (I'll also add the mins, realized that they may be required)
[11:11:10] <_joe_>	 ok, let me take a look
[12:30:28] <elukey>	 back sorry, lemme know your thoughts :)
[12:31:55] <_joe_>	 elukey: lgtm
[12:32:09] <_joe_>	 it seems like a visualization bug before
[12:34:11] <_joe_>	 but I need to check a couple things, I'm in lunch break but I'll get back to you before tonight 
[12:34:15] <_joe_>	 is that ok?
[12:34:18] <elukey>	 sure!
[13:20:37] <ottomata>	 o/ jayme bump on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922874  , I'd love to be able to deploy that and hopefully https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922839 on monday
[13:21:45] <jayme>	 you have a day off on monday ;)
[13:25:46] <ottomata>	 oh uh
[13:25:48] <ottomata>	 tuesday
[13:25:48] <ottomata>	 :)
[13:25:49] <ottomata>	 ty
[13:28:28] <jayme>	 I'll need to double check on the dashboard and we should probably have some task about defining problem states (e.g. on what to alert) as well
[13:29:19] <ottomata>	 indeed.  dashboard still needs some work, will try to do more of that today.  been working more on flink app dash https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1
[13:30:51] <jayme>	 ah, I see. This is really more a full flink cluster dashboard :)
[13:31:02] <ottomata>	 yup
[13:31:18] <ottomata>	 operator dash still needs lots of work https://grafana-rw.wikimedia.org/d/H-sRgqLVk/flink-kubernetes-operator?orgId=1
[13:32:27] <ottomata>	 btw i waffled and decided to just use release in the dash variables to handle the multiple flink cluster apps in one namespace thing i was asking about in the k8s meeting
[13:32:59] <ottomata>	 its a bit weird because a release == a cluster which usually also == 1 job, but flink can have multiple jobs in one cluster, so flink metrics don't report it that way
[13:33:05] <ottomata>	 release is the only label that all metrics have
[13:33:16] <ottomata>	 so both job_name and release are in the flink app dash variables
[13:35:48] <jayme>	 I don't follow
[13:35:56] <jayme>	 what is a dash variable in that context?
[13:45:25] <ottomata>	 a grafana dashboard variable
[13:45:46] <ottomata>	 i want this dash to be useable for any flink-app chart based deployment
[13:49:20] <ottomata>	 my waffling is documented here https://phabricator.wikimedia.org/T337496
[13:49:20] <ottomata>	 :p
[13:50:12] <ottomata>	 jayme: i think i'm a little confused about how admin_ng helmfile merges values
[13:50:15] <ottomata>	 https://integration.wikimedia.org/ci/job/helm-lint/10765/console
[13:50:27] <ottomata>	 i put watchNamespaces in main.yaml
[13:50:27] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922874/5/helmfile.d/admin_ng/values/main.yaml
[13:50:38] <ottomata>	 expecting it to be now set for all main (wikikube) cluster groups
[13:51:59] <ottomata>	 but it was removed for staging-eqiad
[13:54:10] <ottomata>	 i don't suppose there is a way to set cluster_group wide values for a specific release? (flink-operator) ?
[14:02:03] <ottomata>	 moved watchNamespaces back to <env>/flink-operator-values.yaml, fixed it.
[14:07:32] <jayme>	 See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/README.md for how values are sourced
[14:46:24] <ottomata>	 reading the readme...:)
[14:46:26] <ottomata>	 qq
[14:46:28] <ottomata>	 The cluster_group (if any) is defined in the clusters values.yaml at values/< .Environment.Name >/values.yaml
[14:46:30] <ottomata>	 seems to be wrong?
[14:46:45] <ottomata>	 it is values/<.Environment.Name>.yaml ?
[14:48:14] <ottomata>	 jayme:  do we need a values/main/values.yaml?
[14:48:53] <ottomata>	 hm, yeah i think these docs are wrong?
[14:50:23] <_joe_>	 ottomata: if you're talking about admin_ng, your environment is the name of the k8s cluster
[14:51:42] <ottomata>	 yes
[14:51:43] <ottomata>	 but
[14:51:47] <ottomata>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/README.md#basic-releases
[14:51:47] <ottomata>	 says
[14:52:10] <ottomata>	 oh i see, yes i think i pasted the wrong thing
[14:52:25] <ottomata>	 releases values: They can be overridden per cluster_group values/< .Values.cluster_group >/values.yaml
[14:53:06] <_joe_>	 ottomata: maybe you need to take the time to figure it out :)
[14:53:08] <ottomata>	 there are no values/<cluster_group>/values.yaml files? 
[14:53:52] <ottomata>	 _joe_:  i understand the difference between cluster groups and k8s_cluster == environment
[14:55:03] <ottomata>	 and, this question came about because I was trying to set a release (flink-operator) value for the main (wikikube) cluster group in values/main.yaml
[14:55:14] <ottomata>	 and it did not seem to work correctly (the value was not applied for the release)
[14:55:20] <ottomata>	 at least, in staging
[14:55:42] <ottomata>	 so, probably I was doing somethign wrong ^, but these docs do seem to be partially incorrect?
[14:57:15] <_joe_>	 elukey: I think I have the correct fix for the problem
[15:04:01] <elukey>	 super, I'll wait CI and then I'll merge all
[15:32:03] <wikibugs>	 10serviceops, 10SRE-OnFire, 10Traffic, 10conftool, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10JMeybohm) a:03BBlack As you seem to be working on this I'm bluntly assign...
[16:09:21] <wikibugs>	 10serviceops, 10PyBal, 10Release-Engineering-Team, 10SRE, and 4 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10jcrespo) An initial draft of a postmortem for this issue has been posted at: https://w...
[19:00:05] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) [[ https://grafana-rw.wikimedia.org/d/H-sRgqLVk/flink-kubernetes-operator?orgId=1&from=now-7d&to=now&var...