[00:22:41] (03CR) 10Ssingh: [C: 03+2] cp5029: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858674 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [00:23:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5029.eqsin.wmnet with OS buster [00:23:59] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5029.eqsin.wmnet with OS buster [00:39:47] RECOVERY - SSH on mw1329.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:40:25] PROBLEM - Check systemd state on kubernetes1014 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:55] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:57] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5029.eqsin.wmnet with OS buster [00:50:38] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5029.eqsin.wmnet with OS buster [00:50:41] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5029.eqsin.wmnet with OS buster [00:50:49] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5029.eqsin.wmnet with OS buster executed with errors: - cp5029 (**... [00:51:30] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5029.eqsin.wmnet with OS buster [00:51:39] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5029.eqsin.wmnet with OS buster [00:55:41] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:59:59] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:06:29] RECOVERY - Check systemd state on kubernetes1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:33] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5029.eqsin.wmnet with OS buster [01:08:40] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5029.eqsin.wmnet with OS buster executed with errors: - cp5029 (**... [01:08:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5029.eqsin.wmnet with OS buster [01:26:43] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:37:45] (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:37:48] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5029.eqsin.wmnet with reason: host reimage [01:41:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5029.eqsin.wmnet with reason: host reimage [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:03:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5029.eqsin.wmnet with OS buster [03:50:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:55:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:47:36] (03PS4) 10PleaseStand: admin: Clean up duplication in schema.yaml [puppet] - 10https://gerrit.wikimedia.org/r/820891 (https://phabricator.wikimedia.org/T320937) [04:47:38] (03PS6) 10PleaseStand: admin: Add realname, email existence constraints to schema.yaml [puppet] - 10https://gerrit.wikimedia.org/r/820862 (https://phabricator.wikimedia.org/T320937) [05:00:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:05:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:29:58] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:32:57] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:39:59] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:29:03] (03CR) 10Giuseppe Lavagetto: role::kubernetes::wroker: allow scap to pre-pull mediawiki images (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto) [07:53:54] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:55:56] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:00:05] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221121T0800). [08:00:05] Urbanecm: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:15:25] (03PS2) 10Giuseppe Lavagetto: role::kubernetes::wroker: allow scap to pre-pull mediawiki images [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349) [08:21:57] * urbanecm missed the ping [08:21:58] starting [08:22:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858414 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [08:22:35] (03PS2) 10Urbanecm: GrowthExperiments: Enable unstarred mentorship filters at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858414 (https://phabricator.wikimedia.org/T318457) [08:22:40] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable unstarred mentorship filters at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858414 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [08:22:46] (03CR) 10TrainBranchBot: "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858414 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [08:23:18] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:23:27] (03Merged) 10jenkins-bot: GrowthExperiments: Enable unstarred mentorship filters at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858414 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [08:23:46] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:858414|GrowthExperiments: Enable unstarred mentorship filters at all wikis (T318457)]] [08:23:52] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [08:24:16] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:858414|GrowthExperiments: Enable unstarred mentorship filters at all wikis (T318457)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:25:20] (03PS1) 10Giuseppe Lavagetto: scap::dsh: add kubernetes-workers dsh list [puppet] - 10https://gerrit.wikimedia.org/r/858987 (https://phabricator.wikimedia.org/T323349) [08:25:22] (03PS1) 10Giuseppe Lavagetto: scap: add mw on k8s dsh list [puppet] - 10https://gerrit.wikimedia.org/r/858988 (https://phabricator.wikimedia.org/T323349) [08:31:50] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:858414|GrowthExperiments: Enable unstarred mentorship filters at all wikis (T318457)]] (duration: 08m 04s) [08:31:56] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [08:34:54] * urbanecm done [08:39:06] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:40:53] (03PS1) 10Aklapper: phabricator weekly changes email: Only list personal Herald rules [puppet] - 10https://gerrit.wikimedia.org/r/858989 (https://phabricator.wikimedia.org/T323466) [08:42:27] (03PS2) 10Aklapper: phabricator weekly changes email: Only list personal Herald rules [puppet] - 10https://gerrit.wikimedia.org/r/858989 (https://phabricator.wikimedia.org/T323466) [08:44:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond) [08:47:37] (03PS1) 10Aklapper: phabricator weekly changes email: Reorder output for column changes [puppet] - 10https://gerrit.wikimedia.org/r/858990 [09:09:30] (03PS2) 10Giuseppe Lavagetto: etcd: add records compatible with the v3 etcd library [dns] - 10https://gerrit.wikimedia.org/r/841138 (https://phabricator.wikimedia.org/T320397) [09:13:35] (03CR) 10Vgutierrez: [C: 03+1] etcd: add records compatible with the v3 etcd library [dns] - 10https://gerrit.wikimedia.org/r/841138 (https://phabricator.wikimedia.org/T320397) (owner: 10Giuseppe Lavagetto) [09:14:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] etcd: add records compatible with the v3 etcd library [dns] - 10https://gerrit.wikimedia.org/r/841138 (https://phabricator.wikimedia.org/T320397) (owner: 10Giuseppe Lavagetto) [09:15:07] !log restart ml-serve-codfw's kube-apiserver to clear some knative LIST certificate workload (still not sure what it is but it seems a bug related to our ancient version) [09:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:59] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:20:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:21:46] (03PS1) 10Aklapper: phabricator weekly changes email: List dashboard changes [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) [09:22:58] (03PS1) 10Elukey: profile::kubernetes::node: add k8s_116 tag to infrapod's default [puppet] - 10https://gerrit.wikimedia.org/r/858995 (https://phabricator.wikimedia.org/T322920) [09:23:49] (03PS2) 10Elukey: profile::kubernetes::node: add k8s_116 tag to infrapod's default [puppet] - 10https://gerrit.wikimedia.org/r/858995 (https://phabricator.wikimedia.org/T322920) [09:24:10] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:26:59] (03CR) 10JMeybohm: [C: 03+1] profile::kubernetes::node: add k8s_116 tag to infrapod's default [puppet] - 10https://gerrit.wikimedia.org/r/858995 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [09:28:31] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [09:29:11] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [09:29:56] (03PS2) 10Daniel Kinzler: Set parser cache write propability for /page/html endpoint. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858687 (https://phabricator.wikimedia.org/T322672) [09:29:59] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:31:12] !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [09:31:15] !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [09:33:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add the pause image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [09:33:12] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [09:34:58] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:39:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38351/console" [puppet] - 10https://gerrit.wikimedia.org/r/858995 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [09:39:54] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:40:39] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kubernetes::node: add k8s_116 tag to infrapod's default [puppet] - 10https://gerrit.wikimedia.org/r/858995 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [09:44:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] confd: use the v3 style srv records [puppet] - 10https://gerrit.wikimedia.org/r/843873 (https://phabricator.wikimedia.org/T320397) (owner: 10Giuseppe Lavagetto) [09:46:57] (03PS1) 10Jelto: sre.gitlab.upgrade: add cookbook to upgrade GitLab version [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 [09:51:43] (03CR) 10Jcrespo: "This is ok to me, but shouldn't the check for duplicates be updated at modules/openldap/files/cross-validate-accounts.py (validate_duplica" [puppet] - 10https://gerrit.wikimedia.org/r/858567 (owner: 10Jbond) [09:53:12] PROBLEM - confd service on puppetmaster1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:54:50] (03PS1) 10Giuseppe Lavagetto: confd: fix dns record template [puppet] - 10https://gerrit.wikimedia.org/r/859000 [09:55:14] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] confd: fix dns record template [puppet] - 10https://gerrit.wikimedia.org/r/859000 (owner: 10Giuseppe Lavagetto) [09:56:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [09:57:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [09:57:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] glance: use memcached for token caching [puppet] - 10https://gerrit.wikimedia.org/r/858651 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [09:58:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cinder.conf: lock_path to oslo_concurrency [puppet] - 10https://gerrit.wikimedia.org/r/858653 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [09:58:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cinder: remove default quota settings [puppet] - 10https://gerrit.wikimedia.org/r/858654 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [09:59:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] trove: remove network_label_regex [puppet] - 10https://gerrit.wikimedia.org/r/858655 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [09:59:16] RECOVERY - confd service on puppetmaster1001 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:00:08] (03PS12) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [10:02:42] (03CR) 10DCausse: [C: 03+1] "should we increase cloudelastic small clusters too?" [puppet] - 10https://gerrit.wikimedia.org/r/855673 (owner: 10Ebernhardson) [10:04:24] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 136 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:04:40] (03PS1) 10Arturo Borrero Gonzalez: toolforge: haproxy: disable http probes checks [puppet] - 10https://gerrit.wikimedia.org/r/859001 [10:06:20] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:13:03] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:13:11] (03PS1) 10Aklapper: phabricator weekly changes email: List portal changes [puppet] - 10https://gerrit.wikimedia.org/r/859002 (https://phabricator.wikimedia.org/T323477) [10:14:08] (03CR) 10Elukey: [C: 03+1] Retire two k8s Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/858587 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:15:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:17:27] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add the pause image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [10:18:03] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:18:43] (03PS2) 10Aklapper: phabricator weekly changes email: List dashboard changes [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) [10:21:30] (03PS3) 10Aklapper: phabricator weekly changes email: List dashboard changes [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) [10:21:58] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt: eqiad1: prepare hiera configuration for modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859004 (https://phabricator.wikimedia.org/T319184) [10:23:09] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1053: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859005 (https://phabricator.wikimedia.org/T319184) [10:24:37] (03CR) 10Jelto: "I left one small note in-line" [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [10:25:50] (03CR) 10Vgutierrez: [C: 03+2] role::cache: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack) [10:29:30] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/859004/38352/" [puppet] - 10https://gerrit.wikimedia.org/r/859005 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:29:56] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/859004/38352/" [puppet] - 10https://gerrit.wikimedia.org/r/859004 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:30:01] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:33:29] (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt: eqiad1: prepare hiera configuration for modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859004 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:33:51] (03PS1) 10Jcrespo: Transferer: Update encryption to use aes-128-gcm instead of chacha20 [software/transferpy] - 10https://gerrit.wikimedia.org/r/859007 (https://phabricator.wikimedia.org/T321605) [10:33:54] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "the actual PCC https://puppet-compiler.wmflabs.org/output/859005/38353/" [puppet] - 10https://gerrit.wikimedia.org/r/859005 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:34:07] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudvirt: eqiad1: prepare hiera configuration for modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859004 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:34:37] (03CR) 10CI reject: [V: 04-1] Transferer: Update encryption to use aes-128-gcm instead of chacha20 [software/transferpy] - 10https://gerrit.wikimedia.org/r/859007 (https://phabricator.wikimedia.org/T321605) (owner: 10Jcrespo) [10:34:58] (03CR) 10Jcrespo: "recheck" [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) (owner: 10Jcrespo) [10:36:04] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [10:37:06] (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt1053: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859005 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:37:14] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudvirt1053: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859005 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:37:59] (03CR) 10JMeybohm: [C: 03+1] Retire two k8s Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/858587 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:38:04] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1053.eqiad.wmnet with OS bullseye [10:38:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1053.eqiad.wmnet with O... [10:38:57] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) >>! In T306939#8404781, @Papaul wrote: > @Ottomata @BTullis what HW RAID are we using for those servers ? > Thanks Hi @papaul - could we have the following RAID con... [10:39:10] (03PS5) 10Jcrespo: Use the shlex.quote method to escape hosts and paths [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) [10:42:06] (03PS2) 10Jcrespo: Transferer: Update encryption to use aes-128-gcm instead of chacha20 [software/transferpy] - 10https://gerrit.wikimedia.org/r/859007 (https://phabricator.wikimedia.org/T321605) [10:48:03] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [10:48:05] !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) [10:52:14] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage [10:54:54] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage [10:58:17] (03CR) 10Vgutierrez: [C: 03+1] Transferer: Update encryption to use aes-128-gcm instead of chacha20 [software/transferpy] - 10https://gerrit.wikimedia.org/r/859007 (https://phabricator.wikimedia.org/T321605) (owner: 10Jcrespo) [10:58:23] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10jcrespo) To clarify- there is no blocker from SRE team ops to proceed with this, we are eager and waiting for the template to be added on this ticket to... [11:04:46] (03PS1) 10Jbond: cfssl: update pattern to be case insensetive [puppet] - 10https://gerrit.wikimedia.org/r/859010 [11:05:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38354/console" [puppet] - 10https://gerrit.wikimedia.org/r/859010 (owner: 10Jbond) [11:06:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl: update pattern to be case insensetive [puppet] - 10https://gerrit.wikimedia.org/r/859010 (owner: 10Jbond) [11:09:58] (03CR) 10Vgutierrez: Transferer: Update encryption to use aes-128-gcm instead of chacha20 [software/transferpy] - 10https://gerrit.wikimedia.org/r/859007 (https://phabricator.wikimedia.org/T321605) (owner: 10Jcrespo) [11:12:02] (03CR) 10Vgutierrez: [C: 04-2] "enc: AEAD ciphers not supported" [software/transferpy] - 10https://gerrit.wikimedia.org/r/859007 (https://phabricator.wikimedia.org/T321605) (owner: 10Jcrespo) [11:15:31] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [11:20:00] 10SRE, 10database-backups: Transferpy: Enable PBKDF2 usage - https://phabricator.wikimedia.org/T323485 (10Vgutierrez) [11:21:51] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cloudvirt1053.eqiad.wmnet with OS bullseye [11:22:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1053.eqiad.wmnet with OS bu... [11:22:10] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1053.eqiad.wmnet with OS bu... [11:22:44] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10dcaro) [11:24:03] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:38] (03PS1) 10Jbond: wmflib::service::catalog: Use loadyaml [puppet] - 10https://gerrit.wikimedia.org/r/859014 [11:25:01] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:28:58] (03PS2) 10Jbond: wmflib::service::catalog: Use loadyaml [puppet] - 10https://gerrit.wikimedia.org/r/859014 [11:29:03] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::kubernetes::wroker: allow scap to pre-pull mediawiki images [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto) [11:39:06] (03PS1) 10Vgutierrez: site: Move cp nodes to role::cache:text|upload [puppet] - 10https://gerrit.wikimedia.org/r/859015 (https://phabricator.wikimedia.org/T323365) [11:41:27] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38357/console" [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto) [11:41:41] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38358/console" [puppet] - 10https://gerrit.wikimedia.org/r/859015 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [11:43:27] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] role::kubernetes::wroker: allow scap to pre-pull mediawiki images [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto) [11:48:01] (03CR) 10Filippo Giunchedi: "FYI this broke puppet on prometheus hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack) [11:48:06] vgutierrez: ^ [11:48:44] ouch, my fault [11:48:53] my suggestion is to use a dummy class since this is temporary anyways [11:49:33] godog: so for text nodes we can target profile::cache::varnish::frontend::text IIRC [11:49:54] I guess I can create an empty upload one for that purpose [11:50:01] ah nice, yeah that'd work [11:50:01] let me submit a CR addressing that ASAP [11:50:09] sweet, thank you [11:50:36] (03PS1) 10Clément Goubert: P:mediawiki::maintenance::wikidata labs exception [puppet] - 10https://gerrit.wikimedia.org/r/859017 [11:51:06] (03Abandoned) 10Jcrespo: Transferer: Update encryption to use aes-128-gcm instead of chacha20 [software/transferpy] - 10https://gerrit.wikimedia.org/r/859007 (https://phabricator.wikimedia.org/T321605) (owner: 10Jcrespo) [11:53:02] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38359/console" [puppet] - 10https://gerrit.wikimedia.org/r/859017 (owner: 10Clément Goubert) [11:56:31] (03Abandoned) 10Jbond: wmflib::service::catalog: Use loadyaml [puppet] - 10https://gerrit.wikimedia.org/r/859014 (owner: 10Jbond) [11:56:49] (03PS2) 10Vgutierrez: site: Move cp nodes to role::cache:text|upload [puppet] - 10https://gerrit.wikimedia.org/r/859015 (https://phabricator.wikimedia.org/T323365) [11:56:51] (03PS1) 10Vgutierrez: cache: Create dummy profile class to tell between text/upload nodes [puppet] - 10https://gerrit.wikimedia.org/r/859018 (https://phabricator.wikimedia.org/T323365) [11:58:32] (03PS1) 10Giuseppe Lavagetto: mediawiki-image-download: docker uses --config, not -c [puppet] - 10https://gerrit.wikimedia.org/r/859019 [11:59:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki-image-download: docker uses --config, not -c [puppet] - 10https://gerrit.wikimedia.org/r/859019 (owner: 10Giuseppe Lavagetto) [12:00:16] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38360/console" [puppet] - 10https://gerrit.wikimedia.org/r/859018 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [12:00:50] 10SRE, 10Observability-Alerting, 10User-fgiunchedi: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi except for that spike it looks like check latency is under control (and going down, as we progressively remove more and more ch... [12:01:04] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/859018/ [12:01:56] (03PS1) 10Jbond: mediawiki::maintenance::wikidata: update to work with deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/859020 [12:02:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/859018 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [12:02:37] nice [12:02:51] (03CR) 10Ladsgroup: [C: 03+1] P:mediawiki::maintenance::wikidata labs exception [puppet] - 10https://gerrit.wikimedia.org/r/859017 (owner: 10Clément Goubert) [12:03:23] vgutierrez: LGTM [12:03:31] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache: Create dummy profile class to tell between text/upload nodes [puppet] - 10https://gerrit.wikimedia.org/r/859018 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [12:03:41] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:mediawiki::maintenance::wikidata labs exception [puppet] - 10https://gerrit.wikimedia.org/r/859017 (owner: 10Clément Goubert) [12:04:08] godog: done [12:04:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38361/console" [puppet] - 10https://gerrit.wikimedia.org/r/859020 (owner: 10Jbond) [12:04:41] ack, I'll run puppet [12:04:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki: add new type calidation for ca names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858556 (owner: 10Jbond) [12:06:02] godog: BTW, what's the potential impact of having a duplicated job in prometheus? [12:06:05] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38362/console" [puppet] - 10https://gerrit.wikimedia.org/r/859015 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [12:07:04] (03Abandoned) 10Jbond: mediawiki::maintenance::wikidata: update to work with deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/859020 (owner: 10Jbond) [12:07:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki: add new type calidation for ca names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858556 (owner: 10Jbond) [12:08:00] vgutierrez: IIRC prometheus config check will fail and puppet will fail too [12:08:24] hmmm nope, given that puppet is happy [12:08:42] so now that profile::cache:varnish::frontend::upload is there [12:08:52] then I guess/hope it'll do the right thing and merge things together [12:09:08] I could hack ops.pp a little bit [12:09:15] also of course puppet needs to run on upload hosts before the targets show up on prometheus [12:09:24] * godog shakes fist at exported resources once more [12:09:27] to get rid of the duplicated definitions for role::cache::upload and upload_haproxy [12:09:39] (and for text ones of course) [12:09:51] but I don't wanna lose metrics in the process [12:10:03] * vgutierrez sending a CR to provide context [12:10:36] ack, thank you, I'm running puppet on cache hosts in the meantime [12:12:34] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM. Actual dsh list defined in https://gerrit.wikimedia.org/r/c/operations/puppet/+/858987" [puppet] - 10https://gerrit.wikimedia.org/r/858988 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto) [12:12:42] (03PS1) 10Hnowlan: Encode headers before passing [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/859026 (https://phabricator.wikimedia.org/T323114) [12:13:24] (03CR) 10Hnowlan: "Just adding volans as an FYI based on our chats on irc :)" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/859026 (https://phabricator.wikimedia.org/T323114) (owner: 10Hnowlan) [12:13:38] could someone check up on this maintenance script run for me: https://phabricator.wikimedia.org/T315510#8392683 is it finished or still running? [12:13:49] jouncebot: nowandnext [12:13:49] No deployments scheduled for the next 1 hour(s) and 46 minute(s) [12:13:49] In 1 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221121T1400) [12:13:55] (03CR) 10Jbond: [C: 03+1] "lgtm minor, optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [12:14:26] !log jnuche@deploy1002 Installing scap version "4.29.0" for 559 hosts [12:15:00] (03PS3) 10Vgutierrez: site: Move cp nodes to role::cache:text|upload [puppet] - 10https://gerrit.wikimedia.org/r/859015 (https://phabricator.wikimedia.org/T323365) [12:15:01] !log jnuche@deploy1002 Installation of scap version "4.29.0" completed for 559 hosts [12:15:02] (03PS1) 10Vgutierrez: prometheus::ops: Leverage varnish::frontend::text|upload classes [puppet] - 10https://gerrit.wikimedia.org/r/859029 (https://phabricator.wikimedia.org/T323365) [12:15:37] godog: this https://gerrit.wikimedia.org/r/c/operations/puppet/+/859029 [12:15:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [12:15:58] godog: and IIRC I would need to manually wipe the _haproxy_*.yaml files [12:17:06] checking [12:17:46] yes that's right [12:17:58] and we won't lose any metrics, right? [12:18:00] MatmaRex: Still running, on commonswiki at the moment [12:18:19] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::ops: Leverage varnish::frontend::text|upload classes [puppet] - 10https://gerrit.wikimedia.org/r/859029 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [12:18:39] vgutierrez: you shouldn't, labels stay the same AFAICT [12:18:50] godog: what about the job label? [12:19:02] claime: thanks. can you tell how far along is it? (last line of the output) [12:20:14] commonswiki: Processed 35012300 (updated 3878914) of 118789703 rows [12:20:16] commonswiki: --start '["39299440"]' [12:20:21] godog: hmm job_name is still the same one, right [12:20:26] vgutierrez: that's based on the varnish-text vs varnish-upload prefix in the file name [12:20:29] yeah exactly [12:20:36] MatmaRex: ^^^ [12:20:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [12:20:42] godog: lovely, merging that one then :) [12:20:44] thx <3 [12:20:57] for sure! [12:21:13] (03CR) 10Vgutierrez: [C: 03+2] prometheus::ops: Leverage varnish::frontend::text|upload classes [puppet] - 10https://gerrit.wikimedia.org/r/859029 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [12:21:19] claime: thank you [12:21:46] np ;) [12:22:34] (03PS1) 10Giuseppe Lavagetto: kubernetes::mediawiki_runner: allow ssh from the deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/859040 [12:22:37] godog: so in ~60 minutes it should be safe to wipe the old _haproxy_ yaml files rightt? [12:23:52] basically after 2x puppet run in cache nodes and prometheus ones :) [12:24:08] (03PS2) 10Giuseppe Lavagetto: scap::dsh: add kubernetes-workers dsh list [puppet] - 10https://gerrit.wikimedia.org/r/858987 (https://phabricator.wikimedia.org/T323349) [12:24:38] vgutierrez: yes should be safe indeed, if it isn't puppet will rewrite the files anyways [12:25:38] (03PS1) 10Jbond: cross-validate-accounts: Add datacenter-ops to allowed duplicates [puppet] - 10https://gerrit.wikimedia.org/r/859041 [12:25:49] I'm trying to figure out why even after I ran puppet on cp* the files cache_haproxy_tls_mtail_upload_esams.yaml and cache_haproxy_tls_upload_esams.yaml are empty [12:25:56] e.g. in prometheus3001:/srv/prometheus/ops/targets [12:26:33] (running puppet again) [12:26:47] (03CR) 10Jbond: [C: 03+2] cross-validate-accounts: Add datacenter-ops to allowed duplicates [puppet] - 10https://gerrit.wikimedia.org/r/859041 (owner: 10Jbond) [12:27:03] or is that expected vgutierrez I think ? [12:27:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [12:28:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap::dsh: add kubernetes-workers dsh list [puppet] - 10https://gerrit.wikimedia.org/r/858987 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto) [12:28:31] godog: puppet didn't run yet on the cache hosts? [12:28:34] (03CR) 10Vlad.shapik: [C: 03+1] Encode headers before passing [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/859026 (https://phabricator.wikimedia.org/T323114) (owner: 10Hnowlan) [12:28:41] godog: oh.. you triggered that [12:28:45] vgutierrez: I did yeah [12:28:49] profile::cache::varnish::frontend::upload should be there [12:29:12] * vgutierrez looking for typos.. [12:29:37] so.. the class is included here: https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/cache/upload.pp [12:30:32] of crap [12:30:41] the roles aren't symlinked of course [12:30:51] just the yaml files [12:30:52] ah yeah you want role::cache::upload_haproxy [12:30:54] indeed [12:30:55] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1052: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859042 (https://phabricator.wikimedia.org/T319184) [12:30:57] fixing it [12:30:58] *sigh* [12:31:08] ok! I have to run to lunch but you get the idea [12:31:10] (03PS1) 10Giuseppe Lavagetto: scap::dsh: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/859043 [12:31:35] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] scap::dsh: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/859043 (owner: 10Giuseppe Lavagetto) [12:31:48] (03CR) 10Clément Goubert: [C: 03+1] scap::dsh: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/859043 (owner: 10Giuseppe Lavagetto) [12:31:52] gotta go, bbiab [12:32:32] (03PS4) 10Vgutierrez: site: Move cp nodes to role::cache:text|upload [puppet] - 10https://gerrit.wikimedia.org/r/859015 (https://phabricator.wikimedia.org/T323365) [12:32:34] (03PS1) 10Vgutierrez: role::cache_upload: Include profile:c:v:f:upload class [puppet] - 10https://gerrit.wikimedia.org/r/859044 (https://phabricator.wikimedia.org/T323365) [12:32:56] gotta love irccloud rendering an emoji in the middle of that commit msg [12:33:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/858606 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:33:48] (03CR) 10ArielGlenn: [C: 04-1] "Looks like there are still some issues, assuming that pcc is running properly: https://puppet-compiler.wmflabs.org/output/852260/38363/" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [12:34:52] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38364/console" [puppet] - 10https://gerrit.wikimedia.org/r/859044 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [12:35:07] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] role::cache_upload: Include profile:c:v:f:upload class [puppet] - 10https://gerrit.wikimedia.org/r/859044 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [12:35:34] _joe_: ok to merge Giuseppe Lavagetto: scap::dsh: brown paper bag fix (4a7df0d47f) :? [12:35:44] <_joe_> vgutierrez: doh, yes [12:35:52] <_joe_> I was wondering why it didn't work [12:35:57] :) [12:35:59] <_joe_> turns out I didn't merge it :P [12:35:59] (done) [12:36:02] <_joe_> thanks [12:37:35] (03CR) 10ArielGlenn: [C: 03+1] "Yay!" [puppet] - 10https://gerrit.wikimedia.org/r/858662 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [12:38:57] (03CR) 10Clément Goubert: kubernetes::mediawiki_runner: allow ssh from the deployment servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859040 (owner: 10Giuseppe Lavagetto) [12:40:31] (03CR) 10Giuseppe Lavagetto: kubernetes::mediawiki_runner: allow ssh from the deployment servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859040 (owner: 10Giuseppe Lavagetto) [12:41:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:41:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [12:41:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:41:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40231 and previous config saved to /var/cache/conftool/dbconfig/20221121-124146-ladsgroup.json [12:41:51] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [12:42:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [12:44:42] (03CR) 10Hnowlan: [C: 03+2] Encode headers before passing [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/859026 (https://phabricator.wikimedia.org/T323114) (owner: 10Hnowlan) [12:45:01] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Stop spamming SAL with helmfile on scap deployments - https://phabricator.wikimedia.org/T323296 (10Clement_Goubert) 05In progress→03Resolved [12:45:13] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [12:48:51] (03PS1) 10Jcrespo: Transferer: Enable PBKDF2 usage with 310000 iterations [software/transferpy] - 10https://gerrit.wikimedia.org/r/859047 (https://phabricator.wikimedia.org/T323485) [12:49:33] (03Merged) 10jenkins-bot: Encode headers before passing [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/859026 (https://phabricator.wikimedia.org/T323114) (owner: 10Hnowlan) [12:53:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [12:55:37] (03CR) 10ArielGlenn: [C: 03+1] "How did we miss this? Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/858410 (owner: 10Urbanecm) [13:02:17] (03PS1) 10Hnowlan: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/859049 (https://phabricator.wikimedia.org/T312104) [13:06:56] (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt1052: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859042 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:08:03] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/859049 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [13:09:07] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [13:09:10] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [13:10:00] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1052.eqiad.wmnet with OS bullseye [13:10:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1052.eqiad.wmnet with O... [13:10:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1052: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859042 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:14:00] (03Merged) 10jenkins-bot: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/859049 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [13:14:09] vgutierrez: how'd go ? [13:14:20] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [13:15:13] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [13:15:35] godog: triggered puppet runs and it's looking good [13:15:39] I'll continue after lunch [13:16:13] SGTM! [13:17:00] (03CR) 10Urbanecm: "Thanks for the +1! I'll however need help with deployment, please." [puppet] - 10https://gerrit.wikimedia.org/r/858410 (owner: 10Urbanecm) [13:19:19] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: default to valid external url [puppet] - 10https://gerrit.wikimedia.org/r/857522 (https://phabricator.wikimedia.org/T301944) (owner: 10Filippo Giunchedi) [13:20:39] (03PS4) 10ArielGlenn: dumps: Keep only 13 latest growthmentorship dumps [puppet] - 10https://gerrit.wikimedia.org/r/858410 (owner: 10Urbanecm) [13:21:43] (03CR) 10ArielGlenn: [C: 03+2] dumps: Keep only 13 latest growthmentorship dumps [puppet] - 10https://gerrit.wikimedia.org/r/858410 (owner: 10Urbanecm) [13:24:01] (03CR) 10ArielGlenn: [C: 03+2] dumps: Keep only 13 latest growthmentorship dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858410 (owner: 10Urbanecm) [13:24:09] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage [13:24:11] Thanks apergos! [13:24:28] yw! [13:26:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage [13:33:12] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [13:34:02] !log there will a progressive roll restart of prometheus after https://gerrit.wikimedia.org/r/857522 [13:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST mutatingwebhookconfigurations) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:39:38] (03PS1) 10Daniel Kinzler: SimpleParsoidOutputStash: use makeKey() [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859069 (https://phabricator.wikimedia.org/T323357) [13:43:19] (03PS1) 10Daniel Kinzler: HookUtils::parseRevisionParsoidHtml doesn't need HTML for editing [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859070 (https://phabricator.wikimedia.org/T323357) [13:43:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [13:43:55] that's me ^ [13:48:43] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2050.codfw.wmnet [13:53:41] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1052.eqiad.wmnet with OS bullseye [13:53:45] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38365/console" [puppet] - 10https://gerrit.wikimedia.org/r/859015 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [13:53:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1052.eqiad.wmnet with OS bu... [13:54:41] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ms-be2050.codfw.wmnet [13:58:05] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10BTullis) Thanks Jaime, Here are the existing sudo permissions applicable to `analytics-admins`: https://github.com/wikimedia/puppet/blob/production/modu... [13:58:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221121T1400). [14:00:04] duesen and matmarex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:30] o/ [14:01:30] hi [14:01:34] o/ [14:02:45] urbanecm: I have a config patch that is just prep and should have no effect. And I and MatmaRex have two bug fixes related to the same issue, both pretty critical. [14:02:51] okay [14:02:54] let¨s start [14:02:58] (03CR) 10Urbanecm: [C: 03+2] HookUtils::parseRevisionParsoidHtml doesn't need HTML for editing [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859070 (https://phabricator.wikimedia.org/T323357) (owner: 10Daniel Kinzler) [14:03:01] (03CR) 10Urbanecm: [C: 03+2] SimpleParsoidOutputStash: use makeKey() [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859069 (https://phabricator.wikimedia.org/T323357) (owner: 10Daniel Kinzler) [14:03:02] all three are independent of each other [14:03:06] ack [14:03:51] duesen: is it safe to skip mwdebug on the config patch? [14:04:00] since you say it should be no-op, i assume there's nothing to test [14:04:30] urbanecm: yes. if i didn't break the syntax, there is nothing that can go wrong [14:04:34] ack [14:04:35] !log urbanecm@deploy1002 backport aborted: (duration: 00m 51s) [14:04:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858687 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler) [14:05:27] (03Merged) 10jenkins-bot: Set parser cache write propability for /page/html endpoint. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858687 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler) [14:05:40] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:858687|Set parser cache write propability for /page/html endpoint.]] [14:05:49] should be getting out momentarily [14:07:12] urbanecm: i'm curious - are you running scap backport on the two patches separately? Won't that cause confusioN? [14:07:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/855673 (owner: 10Ebernhardson) [14:07:26] duesen: nope, I'm running only one scap at once [14:07:41] i ran `scap backport 858687` first, and then realized mwdebug can be skipped [14:08:03] so i asked, and when you confirmed, i Ctrl+C'ed and ran `scap backport --yes 858687`, which makes scap backport skip the mwdebug step [14:08:05] (03Merged) 10jenkins-bot: HookUtils::parseRevisionParsoidHtml doesn't need HTML for editing [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859070 (https://phabricator.wikimedia.org/T323357) (owner: 10Daniel Kinzler) [14:08:42] for the backports, i +2'ed those manually (not via scap backport), to save on CI time [14:09:15] ah, I see [14:09:38] that's safe to do -- the only catch is that scap backport will always deploy everything that got merged, even if that patch's not specified on the command line [14:09:45] (it will warn you though when that happens) [14:10:04] oh, it will pull all repos? [14:10:10] yes [14:10:14] good to know! [14:10:18] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:858687|Set parser cache write propability for /page/html endpoint.]] (duration: 04m 37s) [14:10:24] and probably the right thing to do, for consistency. [14:10:29] yeah [14:11:10] it checks what it pulled though with what you put at the command line, to decrease the chance of a deployer deploying something unintentionally [14:11:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859070 (https://phabricator.wikimedia.org/T323357) (owner: 10Daniel Kinzler) [14:11:45] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:859070|HookUtils::parseRevisionParsoidHtml doesn't need HTML for editing (T323357)]] [14:12:05] !log urbanecm@deploy1002 urbanecm and daniel: Backport for [[gerrit:859070|HookUtils::parseRevisionParsoidHtml doesn't need HTML for editing (T323357)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:12:21] MatmaRex: duesen: 859070's now at mwdebug1001, can you have a look please? [14:12:38] urbanecm: will do, just a sec. [14:12:48] MatmaRex: can you also confirm your patch on beta? [14:13:07] yeah [14:13:13] well, if it's deployed there [14:13:26] if the patch in master is merged, it should be [14:13:43] i never know how long it takes [14:13:51] in theory, 10 minutes [14:13:57] but sometimes its broken [14:14:03] and Special:Version is always wrong [14:14:11] MatmaRex: you can also see the deployment progress at https://integration.wikimedia.org/ci/ [14:14:42] (search for deployment-deploy03) [14:15:06] or at https://integration.wikimedia.org/ci/job/beta-scap-sync-world/, probably better link [14:16:07] (03PS1) 10Btullis: Promote the aqs_next role to be aqs [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) [14:16:43] anyway, seems to be working on beta: https://en.wikipedia.beta.wmflabs.org/wiki/Special:GoToComment/c-Yatu-20221121141500-Yatu-20220812222900 redirects as expected [14:17:06] (03Merged) 10jenkins-bot: SimpleParsoidOutputStash: use makeKey() [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859069 (https://phabricator.wikimedia.org/T323357) (owner: 10Daniel Kinzler) [14:17:50] MatmaRex: did you confirm on the live site as well? I tried and didn't succeed... [14:18:10] looking now [14:18:40] I was testing https://en.wikipedia.org/wiki/User_talk:DKinzler_(WMF)/Sandbox with https://en.wikipedia.org/wiki/Special:FindComment/c-DKinzler_(WMF)-20221121141500-Second_Discussion [14:19:00] I probably did something wrong [14:19:35] it's not enabled on english wikipedia yet D: [14:19:39] this works: https://test.wikipedia.org/wiki/Special:GoToComment/c-Matma_Rex-20221121141900-Matma_Rex-20220817132600 [14:19:49] heh, that explains it ;) [14:19:51] (i added that comment while on mwdebug1001) [14:20:01] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:20:04] so, ok to sync? [14:20:18] yeah [14:20:51] MatmaRex: the new code should also be a lot faster. are you collecting timeing stats on this operations? [14:21:31] no [14:21:55] syncing [14:22:33] but i have a long-running maintenance script using this code that we probably want to restart once we're done deploying… so we'll see how much faster it gets https://phabricator.wikimedia.org/T315510#8392683 [14:23:19] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) (owner: 10Btullis) [14:24:36] Amir1, _joe_: DiscussionTools will start to write to the parsoid parser cache for edits to talk pages we DT is enabled. This wasn't planned, it's a side effect of a fix for T323357. If it causes a problem, let me know. [14:25:19] <_joe_> I guess that's ok? [14:25:29] yeah it is imo [14:25:30] _joe_: i was hoping you'd say that :) [14:25:32] duesen: I'm fairly certain DT is having issues due refreshlinks update trigger duplicate updates [14:25:44] if you can debug that, it'd be amazing [14:25:51] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:859070|HookUtils::parseRevisionParsoidHtml doesn't need HTML for editing (T323357)]] (duration: 14m 06s) [14:26:01] * duesen looks at MatmaRex [14:26:05] and first patch's live [14:26:11] \o/ [14:26:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859069 (https://phabricator.wikimedia.org/T323357) (owner: 10Daniel Kinzler) [14:26:28] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:859069|SimpleParsoidOutputStash: use makeKey() (T323357)]] [14:26:48] !log urbanecm@deploy1002 urbanecm and daniel: Backport for [[gerrit:859069|SimpleParsoidOutputStash: use makeKey() (T323357)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:26:56] duesen: (that's https://phabricator.wikimedia.org/T323080) [14:26:56] duesen: can you check at mwdebug please? [14:27:41] MatmaRex: if you want me to help with debugging that, let me know. but not today :) [14:27:47] urbanecm: looking now [14:30:13] urbanecm: seems to work! [14:30:17] great, syncing! [14:30:46] for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/859045 I'm here too to verify the fix on the graphite side btw [14:32:01] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10Ottomata) @jcrespo I can make this change once the other approvals have been given. [14:32:31] godog: fyi that patch's currently being pushed to prod [14:32:42] sweet, thank you urbanecm [14:32:44] (03CR) 10Ssingh: [C: 03+1] site: Move cp nodes to role::cache:text|upload [puppet] - 10https://gerrit.wikimedia.org/r/859015 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [14:32:47] you should be already seeing some effect [14:33:19] godog: excellent, i just confirmed that it doesn't break VE. I didn't confirm that it fixed the log spam issue :) [14:33:23] kinda yeah, rate of new metric creation seems to be slowing down [14:33:31] great [14:33:35] duesen: nice, thank you [14:33:53] note however that the patch MatmaRex pushed would also grately slow that rate. [14:33:59] My patch should make it 0, though [14:34:10] \o/ [14:34:19] godog: do you see keys with the prefix objectcache.ParsoidOutputStash popping up? [14:34:27] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:859069|SimpleParsoidOutputStash: use makeKey() (T323357)]] (duration: 07m 58s) [14:34:27] checking [14:34:28] That would confirm that my patch is working [14:34:31] (03CR) 10Bking: [C: 03+2] cirrus: Increase small cluster heap memory from 8G to 10G [puppet] - 10https://gerrit.wikimedia.org/r/855673 (owner: 10Ebernhardson) [14:34:36] godog: duesen: it should be live everywhere now. [14:34:45] urbanecm: thank you [14:34:45] anything else? [14:34:47] np [14:35:09] once we're done deploying, i'd like us to work out what we need to do with the maintenance script at https://phabricator.wikimedia.org/T315510#8392683 [14:35:11] well... let's see if godog confirms that my patch is working as intended. [14:35:19] yeah we're on [14:35:20] -rw-r--r-- 1 _graphite _graphite 331000 Nov 21 14:29 ParsoidOutputStash/get_hit_rate/rate.wsp [14:35:50] and no more metrics created with some rate of urgency [14:36:09] can confirm things are working as expected, I'll clean up the old metrics [14:36:19] godog: did you see metrics with the new prefix? [14:36:49] duesen: I did! MediaWiki/objectcache/ParsoidOutputStash is there [14:36:59] excellent, thank you! [14:37:11] np, thank you for your help on this, much appreciated [14:37:29] And sorry for the inconvenience. I should have known that we need to use makeKey. Though I wasn't aware that it would epxlode graphite if we didn't :) [14:37:59] godog: i have a more general patch up that should prevent this in the future, if you are interested: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/859050 [14:38:11] that script is using this code and currently running (for a week now). i suppose it will pick up the new code when the current wiki finishes, and the next one starts. however, the current wiki is commons, and after ~5 days it is 30% done, so we might want to restart it? on the other hand, because it is calling into RESTBase, which calls back into MediaWiki, it has picked up some of the code changes immediately. so i'm not sure if we ne [14:38:15] yeah neither did I, I'm glad you fixed that pitfall in core too duesen [14:38:18] yeah +1 from me [14:38:33] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020 [14:38:39] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [14:39:14] MatmaRex: we can certainly restart it if you think that'd be better. [14:39:29] or, we can see if it's going faster than before? [14:39:38] MatmaRex: the MW code that RESTbase calls isn't affected by the code changes (I have a patch up for changing that, but it's not in) [14:39:55] well, i don't know what would be better [14:40:05] letting it run would certainly be the simplest [14:40:15] i don't completely understand the scope of the problem here though [14:40:34] !log nuke old objectcache metrics from graphite hosts - T323357 [14:40:36] MatmaRex: DT should use ParsoidOutputAccess directly for everything except for editing. If that is the case, then DT will no longer call RESTbase, and should be much faster. [14:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:15] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10jcrespo) Thanks Ottomata, please use [[ https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ | the template with the checklist ]] I linked to y... [14:41:27] duesen: yes, we'll do that, but i'm thinking about the long-running maintenance script that is running the old version of the code right now [14:41:34] MatmaRex: ah, i see, the script will keep triggering the issue as long as it runs... It will no longer flood graphite (my patch fixed that), but it wil lcontinue to flood main stash. [14:41:46] I'd suggest to restart. [14:41:48] right. exactly [14:42:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [14:42:23] is the main stash also experiencing problems? [14:42:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [14:42:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T323214)', diff saved to https://phabricator.wikimedia.org/P40232 and previous config saved to /var/cache/conftool/dbconfig/20221121-144234-ladsgroup.json [14:42:40] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:43:13] (03PS3) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) [14:43:25] MatmaRex: I don't have good insight on that, you'd have to ask Krinkle. My guess would be that it's seeing a bad cache hit rate, since we are pushing out useful content by stashing a lot of useless stuff. [14:44:00] If you want a qualified opinion, ask the perf team :) [14:44:21] hmm [14:44:22] I have to run out to the store soon. Let me know if there's anything I need to look at [14:44:42] considering that it's currently processing commonswiki [14:45:23] that cache is probably not actually being used for anything very useful [14:46:13] (03PS1) 10DCausse: Add extra-analysis-ukrainian and bump extra plugins to 7.10.2-wmf4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/859064 (https://phabricator.wikimedia.org/T322776) [14:46:37] so i guess i'd prefer to leave it unchanged, unless we actually see issues? seems easier [14:47:14] fwiw currently it says `Processed 35639700 (updated 3991506) of 118789703 rows` [14:47:22] MatmaRex: perhaps give them a heads up that any issues with the main stash may be caused by this [14:48:04] yeah [14:48:11] can you also clarify for me what is "the main stash" here? [14:48:21] !log gehel@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,name=elastic2052.codfw.wmnet [14:48:35] !log repooling elastic2052 - T320482 [14:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:40] T320482: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 [14:48:46] ryankemper: ^^ [14:48:46] ah, well, it's not urgent, i think [14:48:57] so feel free to run out [14:49:01] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10Gehel) 05Open→03Resolved [14:49:27] and let's wait for Krinkle to wake up and see the pings [14:50:06] so, no restart for now? [14:50:41] no restart [14:50:44] ack [14:50:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40233 and previous config saved to /var/cache/conftool/dbconfig/20221121-145052-ladsgroup.json [14:50:58] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:51:22] RECOVERY - cassandra-b CQL 10.64.131.15:9042 on aqs1020 is OK: TCP OK - 0.001 second response time on 10.64.131.15 port 9042 https://phabricator.wikimedia.org/T93886 [14:52:12] MatmaRex: $wgMainStash is set to 'memcached-pecl', using MemcachedPeclBagOStuff on 127.0.0.1:11212. [14:52:24] MatmaRex: don't ask me waht that means, I don't know ;( [14:52:51] (03PS1) 10Ssingh: sites.yaml: add lvs4009 (ulsfo hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/859065 (https://phabricator.wikimedia.org/T317247) [14:54:14] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [14:54:15] !log btullis@cumin1001 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) [14:55:38] (03PS1) 10Ssingh: lvs4006: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/859086 [14:56:26] (03PS2) 10Giuseppe Lavagetto: kubernetes::mediawiki_runner: allow ssh from the deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/859040 [14:58:29] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Jclark-ctr) @BTullis can an-tool1010 stay in same row? [14:58:50] PROBLEM - ensure kvm processes are running on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:59:14] (03PS3) 10Giuseppe Lavagetto: kubernetes::mediawiki_runner: allow ssh from the deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/859040 [14:59:19] MatmaRex: I'm up, but meetings, MainStash visibility is MySQL aggregate dashboard in Grafana also -sre chan and Amir1. Parser cache has a dash too. I need shorter summary or level of urgent to act sooner than 5h from now. Feel free to PM [15:00:50] RECOVERY - ensure kvm processes are running on cloudvirt1052 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:00:51] Krinkle: i think not that urgent [15:01:13] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] site: Move cp nodes to role::cache:text|upload [puppet] - 10https://gerrit.wikimedia.org/r/859015 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [15:01:51] Krinkle: tl;dr it seems we've been writing stuff to main stash unnecessarily, is there a cache hit rate problem on cawiki (a couple days ago) or commonswiki (right now)? is there a dashboard that would answer this question? [15:02:15] Krinkle: and if there isn't such a problem, then there's nothing to do [15:02:40] (i'm away for a minute too) [15:02:49] (03PS4) 10Giuseppe Lavagetto: kubernetes::mediawiki_runner: allow ssh from the deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/859040 [15:03:42] (03PS2) 10Ssingh: cp5030: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858675 (https://phabricator.wikimedia.org/T322048) [15:03:50] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38368/console" [puppet] - 10https://gerrit.wikimedia.org/r/859040 (owner: 10Giuseppe Lavagetto) [15:05:06] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38369/console" [puppet] - 10https://gerrit.wikimedia.org/r/859040 (owner: 10Giuseppe Lavagetto) [15:05:25] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1051: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859087 (https://phabricator.wikimedia.org/T319184) [15:05:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P40234 and previous config saved to /var/cache/conftool/dbconfig/20221121-150558-ladsgroup.json [15:06:22] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:06:32] PROBLEM - Host db2173.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:07:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Jclark-ctr) cephosd1001 E1 U3 Port 3 Cableid# 20220225 cephosd1002 E2 U3 Port 3 Cableid# 20220237 cephosd1003 E3 U3 P... [15:09:10] (03CR) 10Giuseppe Lavagetto: [V: 03+1] kubernetes::mediawiki_runner: allow ssh from the deployment servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859040 (owner: 10Giuseppe Lavagetto) [15:11:21] (03CR) 10Clément Goubert: [C: 03+1] kubernetes::mediawiki_runner: allow ssh from the deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/859040 (owner: 10Giuseppe Lavagetto) [15:12:29] PROBLEM - Host db2174 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:12:42] Amir1: expected? [15:12:55] here [15:13:14] please ack the page, I am checking for mw errors [15:13:22] ok doing [15:13:30] I see no connection errors, so either not mw or not pooled [15:13:45] I am wrong [15:13:49] it is pooled, so a real problem [15:13:54] (03CR) 10Ssingh: [C: 03+2] cp5030: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858675 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [15:14:02] I will depool according to runbug [15:14:51] RECOVERY - Host db2174 #page is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [15:15:02] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db2174 - crash?', diff saved to https://phabricator.wikimedia.org/P40235 and previous config saved to /var/cache/conftool/dbconfig/20221121-151501-jynus.json [15:15:09] jynus: ^it recovered [15:15:18] (03CR) 10Vivian Rook: [C: 03+1] cloudvirt1051: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859087 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [15:15:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5030.eqsin.wmnet with OS buster [15:15:21] ^ jelto herron vgutierrez [15:15:26] the depool I meant [15:15:27] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5030.eqsin.wmnet with OS buster [15:15:35] errors should go down [15:15:47] 15:15:15 up 1 min, 1 user, load average: 0.39, 0.17, 0.06 [15:15:59] (in general should be errors that don't reach users, but it is not 100% clean) [15:16:06] !log initiating Cassandra bootstrap, aqs1018-a -- T307802 [15:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:11] T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 [15:16:49] I did https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica [15:16:52] FYI [15:16:56] 6 | Nov-21-2022 | 14:10:55 | Status | Power Supply | Power Supply input lost (AC/DC) [15:16:56] PROBLEM - MariaDB Replica IO: s1 on db2174 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:17:05] I'm on it [15:17:15] depooled already [15:17:23] should I downtime/silence on icinga? [15:17:24] thanks [15:17:29] (03Abandoned) 10Ssingh: cp5031: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858676 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [15:17:50] I will handle it [15:17:57] handing it to you [15:18:06] RECOVERY - cassandra-a service on aqs1018 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:18:29] (03PS1) 10Vgutierrez: role::cache: Remove text|upload_haproxy roles [puppet] - 10https://gerrit.wikimedia.org/r/859090 (https://phabricator.wikimedia.org/T323365) [15:18:36] RECOVERY - cassandra-a SSL 10.64.32.22:7001 on aqs1018 is OK: SSL OK - Certificate aqs1018-a valid until 2024-11-08 15:06:25 +0000 (expires in 717 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:19:12] PROBLEM - MariaDB Replica SQL: s1 on db2174 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:19:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1051: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859087 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [15:19:36] FWIW, these are not as terrible as they look, mediawiki autoamtically avoids connecting to a lagged or broken replica [15:19:40] PROBLEM - mysqld processes on db2174 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:19:49] lots of crashes lately, let me guess, memory issue? [15:19:52] amir1: thanks! do you need anything from my side at the moment? [15:19:56] PROBLEM - MariaDB read only s1 on db2174 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:19:59] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1051.eqiad.wmnet with OS bullseye [15:20:01] jelto: I think so [15:20:09] I can take a look at the memory if needed [15:20:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1051.eqiad.wmnet with O... [15:20:10] *don't [15:20:16] nah [15:21:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P40236 and previous config saved to /var/cache/conftool/dbconfig/20221121-152105-ladsgroup.json [15:21:23] the mw impact I saw it on this dashboard: https://logstash.wikimedia.org/goto/b916891c525c96b31a66c6f8f1132180 [15:21:44] jynus: mw loves to error and warn, it's not user impacting [15:21:46] (just speaking aloud what I did, not for the DBA, who knows it already, but to the benefit of people on call) [15:22:29] the only thing that can possibly be user-facing is that the maxlag will be high and bots slow down [15:22:43] but that's not the case with codfw replicas [15:22:44] Amir1: I know the load balancer does its job in theory- just it impacts on the fly queries at least [15:22:53] *on the fly [15:23:02] PROBLEM - Check systemd state on ms-be1052 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:29] and I am not 100% sure all obscure patterns fail gracefully :-D [15:23:30] PROBLEM - Check systemd state on dse-k8s-worker1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:51] but yeah, I agree most queries will just switch to other server [15:24:36] jynus: mw internally marks it as lagged and never connects to it. Sure in the one minute, some requests might fail but that's negligible (=error budget) [15:24:39] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38370/console" [puppet] - 10https://gerrit.wikimedia.org/r/859090 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [15:24:46] anyway, it's either network cable or nic, let me see [15:25:39] my guess is that very small orange spike: https://grafana.wikimedia.org/goto/oz83GAOVz?orgId=1 [15:25:45] Amir1: did you see my paste above from ipmi-sel? there was a power supply event today too [15:25:54] amir1: uptime of 1 minutes sounds more like memory or power issue instead of network cable or nic ? [15:25:54] (03CR) 10Vgutierrez: role::cache: Remove text|upload_haproxy roles [puppet] - 10https://gerrit.wikimedia.org/r/859090 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [15:26:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2174.codfw.wmnet with reason: hw issues [15:26:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2174.codfw.wmnet with reason: hw issues [15:27:03] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:27:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:27:36] jelto: yeah it's likely, I'm checking, the last bit in kern.log was about network [15:27:39] that confused me [15:28:23] yup, the uptime says it's probably power issue [15:28:35] let me do the honors bringing it back online [15:29:58] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST authorizationpolicies) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:30:17] db2173 is also down it seems, I don't know why [15:30:17] RECOVERY - Host db2173.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.64 ms [15:30:30] jelto: herron I brought it back, now it's catching up [15:30:44] Amir1: ack thx [15:32:03] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:32:28] Amir1: management console of db2174 says "11/21/2022 14:10:55: The power input for power supply 1 is lost." [15:33:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:46] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage [15:34:26] (03CR) 10Ssingh: [C: 03+1] role::cache: Remove text|upload_haproxy roles [puppet] - 10https://gerrit.wikimedia.org/r/859090 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [15:35:16] jelto: thanks, do you think I should file a task for dc ops? [15:36:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40237 and previous config saved to /var/cache/conftool/dbconfig/20221121-153611-ladsgroup.json [15:36:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [15:36:18] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [15:36:56] RECOVERY - MariaDB read only s1 on db2174 is OK: Version 10.4.25-MariaDB-log, Uptime 429s, read_only: True, event_scheduler: True, 139.54 QPS, connection latency: 0.003943s, query latency: 0.000431s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:36:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [15:37:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T323214)', diff saved to https://phabricator.wikimedia.org/P40238 and previous config saved to /var/cache/conftool/dbconfig/20221121-153705-ladsgroup.json [15:37:10] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage [15:37:18] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:19] Amir1: yes let's open a short task for awareness. Do you want to do that? [15:38:11] looks like db217[34] are both in D8 as well [15:38:11] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [15:39:13] (03PS2) 10Vgutierrez: role::cache: Remove text|upload_haproxy roles [puppet] - 10https://gerrit.wikimedia.org/r/859090 (https://phabricator.wikimedia.org/T323365) [15:40:08] (03CR) 10Ssingh: [C: 03+1] role::cache: Remove text|upload_haproxy roles [puppet] - 10https://gerrit.wikimedia.org/r/859090 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [15:41:26] jelto: I'll do it, thanks [15:41:54] PROBLEM - Check whether ferm is active by checking the default input chain on dse-k8s-worker1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:42:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5030.eqsin.wmnet with reason: host reimage [15:43:14] RECOVERY - mysqld processes on db2174 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:44:33] Amir1: great thanks! [15:45:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5030.eqsin.wmnet with reason: host reimage [15:46:08] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] kubernetes::mediawiki_runner: allow ssh from the deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/859040 (owner: 10Giuseppe Lavagetto) [15:48:13] jouncebot: nowandnext [15:48:13] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [15:48:13] In 0 hour(s) and 41 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221121T1630) [15:48:40] ok for me to run a short maintenance script? (cc Amir1 and jelto, since it sounds like something was going on recently) [15:50:20] RECOVERY - Check systemd state on ms-be1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:01] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Jelto) ## LVM/disk setup >>! In T323262#8403593, @Dzahn wrote: > > I looked at tickets like T313250, T24... [15:52:00] Lucas_WMDE: I guess the should be fine. Amir1 what do you think? [15:52:21] yup go ahead [15:52:26] ok, thanks! [15:52:58] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:54:23] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php wikidatawiki --property-id P11136 --new-data-type string # T323470 [15:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:28] T323470: Change Property "shortened URL formatter (P11136)" data type from external identifier to string - https://phabricator.wikimedia.org/T323470 [15:54:41] * Lucas_WMDE done [15:55:04] RECOVERY - MariaDB Replica IO: s1 on db2174 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:56:03] Amir1: is there anything more needed to resolve the incident? afaik the instance is back online as a slave and no immidiate action is needed anymore? [15:56:33] jelto: it should automatically be resolved given that the alert should be recovered [15:56:53] maybe it takes a bit? [15:57:19] (03PS1) 10Bking: elastic: Increase small cluster heap memory from 8G to 10G [puppet] - 10https://gerrit.wikimedia.org/r/859094 [15:57:32] woha that's a lot of databases https://netbox.wikimedia.org/dcim/racks/74/ [15:57:56] thankfully only db2173 and db2174 seems to be down: https://orchestrator.wikimedia.org/web/clusters [15:58:05] (otherwise, you'll see more black dots) [15:58:27] Amir1: the alert is in open and acknowledgedstate. But I can also wait a bit longer :) [15:58:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/859094 (owner: 10Bking) [15:58:53] it's 45m, I guess this specific one doesn't resolve itself [15:59:00] resolved it [15:59:16] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38371/console" [puppet] - 10https://gerrit.wikimedia.org/r/859090 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [15:59:26] (can be that we are working on batching of alerts, and caused by it) [15:59:28] (03PS2) 10Bking: cloudelastic: Increase small cluster heap memory from 8G to 10G [puppet] - 10https://gerrit.wikimedia.org/r/859094 [15:59:37] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops-radar: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [15:59:56] PROBLEM - IPMI Sensor Status on an-worker1148 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:59:57] I go bring back db2173 online [15:59:58] RECOVERY - MariaDB Replica SQL: s1 on db2174 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:00:11] haha [16:00:14] whatever [16:00:19] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] role::cache: Remove text|upload_haproxy roles [puppet] - 10https://gerrit.wikimedia.org/r/859090 (https://phabricator.wikimedia.org/T323365) (owner: 10Vgutierrez) [16:00:23] Amir1: it recovered :) [16:01:58] RECOVERY - Check systemd state on dse-k8s-worker1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:16] how did you bring db2173 online? T322988 [16:03:16] T322988: db2173 HW errors - https://phabricator.wikimedia.org/T322988 [16:04:02] do you mean db2174? [16:04:11] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1051.eqiad.wmnet with OS bullseye [16:04:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1051.eqiad.wmnet with OS bu... [16:04:45] hmm, db2173 is in the same rack, I thought they have the same issue but it's actually not [16:05:17] 10SRE, 10Traffic, 10Patch-For-Review: Rename role::cache::(text|upload)_haproxy to role::cache::(text|upload) - https://phabricator.wikimedia.org/T323365 (10Vgutierrez) 05Open→03Resolved [16:05:19] what I meant is that host manuel couldn't switch it on at all [16:05:24] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [16:05:26] PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:29] and it is pending servicing by vendor [16:06:05] if you did put it up that's good, just suprised [16:06:36] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:08:04] PROBLEM - ensure kvm processes are running on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:08:10] (03PS2) 10Ssingh: lvs4006: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/859086 (https://phabricator.wikimedia.org/T317247) [16:08:22] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:08:50] (03CR) 10Vgutierrez: [C: 03+1] Transferer: Enable PBKDF2 usage with 310000 iterations (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/859047 (https://phabricator.wikimedia.org/T323485) (owner: 10Jcrespo) [16:10:02] RECOVERY - ensure kvm processes are running on cloudvirt1051 is OK: PROCS OK: 2 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:10:17] (03CR) 10Vgutierrez: [C: 03+1] lvs4006: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/859086 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [16:10:39] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1050: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859095 (https://phabricator.wikimedia.org/T319184) [16:12:50] RECOVERY - Check whether ferm is active by checking the default input chain on dse-k8s-worker1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:12:58] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:17:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5030.eqsin.wmnet with OS buster [16:17:09] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5030.eqsin.wmnet with OS buster completed: - cp5030 (**PASS**) -... [16:18:45] (03PS2) 10Eevans: sessionstore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) [16:18:59] (03CR) 10Eevans: [C: 03+1] sessionstore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans) [16:19:16] (03CR) 10Elukey: "Found something in the helmfile that feels not in the right place to me, lemme know your thoughts! I am going to review the rbac requireme" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [16:19:59] (03PS2) 10Jcrespo: Transferer: Enable PBKDF2 usage with 310000 iterations [software/transferpy] - 10https://gerrit.wikimedia.org/r/859047 (https://phabricator.wikimedia.org/T323485) [16:21:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:13] (03CR) 10Jcrespo: "Question (inline)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/859047 (https://phabricator.wikimedia.org/T323485) (owner: 10Jcrespo) [16:21:53] (03PS1) 10Filippo Giunchedi: confd: create /var/run/confd-template [puppet] - 10https://gerrit.wikimedia.org/r/859102 (https://phabricator.wikimedia.org/T319163) [16:23:40] (03CR) 10Krinkle: [C: 03+1] webperf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858605 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:26:36] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38372/console" [puppet] - 10https://gerrit.wikimedia.org/r/859102 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [16:27:16] (03PS2) 10Filippo Giunchedi: confd: create /var/run/confd-template [puppet] - 10https://gerrit.wikimedia.org/r/859102 (https://phabricator.wikimedia.org/T321678) [16:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221121T1630). [16:30:23] (03CR) 10Vivian Rook: [C: 03+1] cloudvirt1050: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859095 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [16:35:15] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED [16:36:33] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 118 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:37:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T323214)', diff saved to https://phabricator.wikimedia.org/P40239 and previous config saved to /var/cache/conftool/dbconfig/20221121-163733-ladsgroup.json [16:37:39] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:37:40] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:38:12] 10SRE-OnFire, 10Discovery-Search, 10Observability-Alerting, 10Sustainability (Incident Followup): Improve Search team alerting for missing masters - https://phabricator.wikimedia.org/T313095 (10Gehel) [16:38:19] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED [16:39:04] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED [16:43:38] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED [16:45:27] 10ops-codfw, 10DBA: db1174 lost power - https://phabricator.wikimedia.org/T323512 (10Ladsgroup) [16:45:34] made T323512 [16:45:34] T323512: db1174 lost power - https://phabricator.wikimedia.org/T323512 [16:46:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T323214)', diff saved to https://phabricator.wikimedia.org/P40240 and previous config saved to /var/cache/conftool/dbconfig/20221121-164620-ladsgroup.json [16:46:27] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:47:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:47:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:47:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:11] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859104 (https://phabricator.wikimedia.org/T128546) [16:49:55] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859104 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:49:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [16:50:49] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859104 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:52:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P40241 and previous config saved to /var/cache/conftool/dbconfig/20221121-165240-ladsgroup.json [16:53:28] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM request for cloud-cumin1001 - https://phabricator.wikimedia.org/T323516 (10Volans) [16:53:47] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM request for cloud-cumin1001 - https://phabricator.wikimedia.org/T323516 (10Volans) p:05Triage→03Medium a:03Volans [16:54:01] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM request for cloud-cumin1001 - https://phabricator.wikimedia.org/T323516 (10Volans) [16:54:03] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10Volans) [16:54:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:54:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:56:33] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:859104| Bumping portals to master (T128546)]] (duration: 03m 36s) [16:56:40] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:57:30] 10ops-codfw, 10DBA: db1174 lost power - https://phabricator.wikimedia.org/T323512 (10jcrespo) First timeout matches that log: ` Service Unknown[2022-11-21 15:11:00] SERVICE ALERT: db2174;Check for large files in client bucket;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds. ` However,... [16:58:51] (03PS1) 10Hnowlan: thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) [16:59:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:59:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [17:00:12] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:859104| Bumping portals to master (T128546)]] (duration: 03m 38s) [17:00:25] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED [17:01:23] RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P40242 and previous config saved to /var/cache/conftool/dbconfig/20221121-170127-ladsgroup.json [17:01:50] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: (1) VM request for cumin-cloud2001 - https://phabricator.wikimedia.org/T323518 (10Volans) [17:02:16] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: (1) VM request for cumin-cloud2001 - https://phabricator.wikimedia.org/T323518 (10Volans) p:05Triage→03Medium a:03Volans [17:02:47] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: (1) VM request for cumin-cloud2001 - https://phabricator.wikimedia.org/T323518 (10Volans) [17:02:50] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10Volans) [17:03:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [17:03:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [17:03:35] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10Volans) [17:03:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:03:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:03:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40243 and previous config saved to /var/cache/conftool/dbconfig/20221121-170357-ladsgroup.json [17:04:03] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [17:04:33] (03PS1) 10Papaul: Add kafka-jumbo101[0-5] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/859107 (https://phabricator.wikimedia.org/T306939) [17:04:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [17:04:50] 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Papaul) [17:04:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [17:05:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [17:05:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [17:05:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T322618)', diff saved to https://phabricator.wikimedia.org/P40244 and previous config saved to /var/cache/conftool/dbconfig/20221121-170529-ladsgroup.json [17:05:46] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED [17:06:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40245 and previous config saved to /var/cache/conftool/dbconfig/20221121-170608-ladsgroup.json [17:06:10] 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Papaul) i checked all looks good on the server. @Ladsgroup can you confirm that all is good us on your end in this server ? Thanks [17:06:55] (03PS1) 10Bartosz Dziewoński: Fix no-JS Special:Notifications only displaying one notification per day [extensions/Echo] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859071 (https://phabricator.wikimedia.org/T323491) [17:07:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P40246 and previous config saved to /var/cache/conftool/dbconfig/20221121-170746-ladsgroup.json [17:08:14] (03CR) 10Papaul: [C: 03+2] Add kafka-jumbo101[0-5] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/859107 (https://phabricator.wikimedia.org/T306939) (owner: 10Papaul) [17:09:03] (03PS2) 10Papaul: Add kafka-jumbo101[0-5] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/859107 (https://phabricator.wikimedia.org/T306939) [17:11:43] (03CR) 10Dzahn: vrts: add error checking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [17:14:20] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED [17:15:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [17:16:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [17:16:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P40247 and previous config saved to /var/cache/conftool/dbconfig/20221121-171615-ladsgroup.json [17:16:20] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [17:16:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P40248 and previous config saved to /var/cache/conftool/dbconfig/20221121-171635-ladsgroup.json [17:16:37] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10Ottomata) [17:16:52] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10Ottomata) Done, I removed irrelevant parts, if that is okay. [17:17:58] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED [17:18:30] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs4009'] [17:19:42] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs4010'] [17:19:56] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs4010'] [17:19:58] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs4009'] [17:21:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P40249 and previous config saved to /var/cache/conftool/dbconfig/20221121-172114-ladsgroup.json [17:22:33] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [17:22:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T323214)', diff saved to https://phabricator.wikimedia.org/P40250 and previous config saved to /var/cache/conftool/dbconfig/20221121-172253-ladsgroup.json [17:22:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [17:22:59] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [17:23:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [17:23:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T323214)', diff saved to https://phabricator.wikimedia.org/P40251 and previous config saved to /var/cache/conftool/dbconfig/20221121-172314-ladsgroup.json [17:26:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P40252 and previous config saved to /var/cache/conftool/dbconfig/20221121-172648-ladsgroup.json [17:27:49] (03PS1) 10Filippo Giunchedi: pontoon: write out puppet/pki CA certs [puppet] - 10https://gerrit.wikimedia.org/r/859112 (https://phabricator.wikimedia.org/T319163) [17:28:30] (03CR) 10CI reject: [V: 04-1] pontoon: write out puppet/pki CA certs [puppet] - 10https://gerrit.wikimedia.org/r/859112 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [17:28:53] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:30:33] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [17:30:53] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 21 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:31:19] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: remove hardcoded ports, use parameters in my.cnf for admins [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [17:31:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1010.eqiad.wmnet with OS bullseye [17:31:39] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-jumbo1010.eqiad.wmnet with OS bullseye [17:31:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T323214)', diff saved to https://phabricator.wikimedia.org/P40253 and previous config saved to /var/cache/conftool/dbconfig/20221121-173141-ladsgroup.json [17:31:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [17:31:49] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [17:31:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [17:32:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T323214)', diff saved to https://phabricator.wikimedia.org/P40254 and previous config saved to /var/cache/conftool/dbconfig/20221121-173203-ladsgroup.json [17:34:06] (03PS2) 10Filippo Giunchedi: pontoon: write out puppet/pki CA certs [puppet] - 10https://gerrit.wikimedia.org/r/859112 (https://phabricator.wikimedia.org/T319163) [17:36:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P40255 and previous config saved to /var/cache/conftool/dbconfig/20221121-173621-ladsgroup.json [17:38:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P40256 and previous config saved to /var/cache/conftool/dbconfig/20221121-173800-ladsgroup.json [17:41:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P40257 and previous config saved to /var/cache/conftool/dbconfig/20221121-174153-ladsgroup.json [17:42:26] 10SRE, 10serviceops: Add `supervised` option to redis configuration - https://phabricator.wikimedia.org/T212102 (10jijiki) [17:43:57] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) @BTullis thanks for the update looks like we have an issue with the partman recipe can you please take a look and let me know thanks ` ────────────────────┤ [!] Partit... [17:45:14] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Encoding issues when handling unicode characters in filenames - https://phabricator.wikimedia.org/T323114 (10hnowlan) This appears to be fixed. The issue relates to us calling Tornado's `set_header` with a string that contains non-ascii cha... [17:45:35] (03PS1) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [17:48:53] (03CR) 10CI reject: [V: 04-1] cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 (owner: 10Arturo Borrero Gonzalez) [17:51:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40258 and previous config saved to /var/cache/conftool/dbconfig/20221121-175127-ladsgroup.json [17:51:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:51:38] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [17:51:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:51:47] 10SRE, 10observability, 10serviceops, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10jijiki) 05Open→03Resolved We haven't had any issues caused due to high memcached traffic for quite a long time. Our measures (gutter pool, o... [17:51:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T322618)', diff saved to https://phabricator.wikimedia.org/P40259 and previous config saved to /var/cache/conftool/dbconfig/20221121-175149-ladsgroup.json [17:53:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T322618)', diff saved to https://phabricator.wikimedia.org/P40260 and previous config saved to /var/cache/conftool/dbconfig/20221121-175306-ladsgroup.json [17:53:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [17:53:14] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10jcrespo) >>! In T323280#8410218, @Ottomata wrote: > Done, I removed irrelevant parts, if that is okay. 👍 Sorry to be pedantic about this, it is not me... [17:53:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [17:53:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T322618)', diff saved to https://phabricator.wikimedia.org/P40261 and previous config saved to /var/cache/conftool/dbconfig/20221121-175328-ladsgroup.json [17:54:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T322618)', diff saved to https://phabricator.wikimedia.org/P40262 and previous config saved to /var/cache/conftool/dbconfig/20221121-175359-ladsgroup.json [17:55:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T322618)', diff saved to https://phabricator.wikimedia.org/P40263 and previous config saved to /var/cache/conftool/dbconfig/20221121-175548-ladsgroup.json [17:56:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P40264 and previous config saved to /var/cache/conftool/dbconfig/20221121-175658-ladsgroup.json [17:59:28] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020 [17:59:30] 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10jcrespo) @Papaul, I wonder if we could do a "simple" test of checking the power supply redundancy by "pulling the plug" (literally or just pushing the on/off button) to check the power redundancy is working as it is e... [17:59:37] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [18:00:05] ryankemper: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221121T1800). [18:00:16] 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10jcrespo) p:05Triage→03High [18:00:19] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020 [18:01:17] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 152 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:02:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:02:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:05:17] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:05:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [18:06:46] (03CR) 10Dzahn: [V: 03+1 C: 03+1] phabricator: remove hardcoded ports, use parameters in my.cnf for admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:07:07] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: remove hardcoded ports, use parameters in my.cnf for admins [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:09:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P40265 and previous config saved to /var/cache/conftool/dbconfig/20221121-180906-ladsgroup.json [18:09:23] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on all 3 servers" [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:10:10] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Encoding issues when handling unicode characters in filenames - https://phabricator.wikimedia.org/T323114 (10Vlad.shapik) >>! In T323114#8410298, @hnowlan wrote: > This appears to be fixed. The issue relates to us calling Tornado's `set_hea... [18:10:18] (03CR) 10Dzahn: [C: 03+2] dumps: remove phab1001 from rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/858662 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:10:20] (03PS3) 10Herron: slo_dashboards: move to one SLO/SLI per dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749) [18:10:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [18:10:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [18:10:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:10:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P40266 and previous config saved to /var/cache/conftool/dbconfig/20221121-181054-ladsgroup.json [18:11:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:11:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P40267 and previous config saved to /var/cache/conftool/dbconfig/20221121-181116-ladsgroup.json [18:11:22] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:12:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P40268 and previous config saved to /var/cache/conftool/dbconfig/20221121-181203-ladsgroup.json [18:15:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P40269 and previous config saved to /var/cache/conftool/dbconfig/20221121-181512-ladsgroup.json [18:17:40] (03PS1) 10DDesouza: Deploy Research Incentive survey on swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859125 (https://phabricator.wikimedia.org/T321252) [18:19:49] PROBLEM - SSH on mw1332.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:20:16] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:22:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [18:22:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [18:22:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [18:23:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [18:23:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P40270 and previous config saved to /var/cache/conftool/dbconfig/20221121-182306-ladsgroup.json [18:23:12] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:24:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P40271 and previous config saved to /var/cache/conftool/dbconfig/20221121-182412-ladsgroup.json [18:26:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P40272 and previous config saved to /var/cache/conftool/dbconfig/20221121-182601-ladsgroup.json [18:27:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1010.eqiad.wmnet with OS bullseye [18:27:09] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-jumbo1010.eqiad.wmnet with OS bullseye executed with errors: - kafka-jumbo1... [18:29:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson an-coord1003 E1 U36 Port 36 Cableid # 20220001 an... [18:30:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Jclark-ctr) [18:30:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P40273 and previous config saved to /var/cache/conftool/dbconfig/20221121-183019-ladsgroup.json [18:31:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P40274 and previous config saved to /var/cache/conftool/dbconfig/20221121-183104-ladsgroup.json [18:31:10] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:31:17] (03PS1) 10Hashar: Display Zuul status of jobs for a change in Gerrit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859127 (https://phabricator.wikimedia.org/T214068) [18:36:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T323214)', diff saved to https://phabricator.wikimedia.org/P40275 and previous config saved to /var/cache/conftool/dbconfig/20221121-183639-ladsgroup.json [18:36:45] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [18:39:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T322618)', diff saved to https://phabricator.wikimedia.org/P40276 and previous config saved to /var/cache/conftool/dbconfig/20221121-183919-ladsgroup.json [18:39:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:39:25] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:39:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:39:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [18:39:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [18:40:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40277 and previous config saved to /var/cache/conftool/dbconfig/20221121-183959-ladsgroup.json [18:40:39] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10Dzahn) [18:40:42] 10SRE, 10ops-codfw: Broken disk on ganeti2013 - https://phabricator.wikimedia.org/T323220 (10Dzahn) [18:41:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T322618)', diff saved to https://phabricator.wikimedia.org/P40278 and previous config saved to /var/cache/conftool/dbconfig/20221121-184107-ladsgroup.json [18:41:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [18:41:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [18:41:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [18:41:33] !log remove dnsdist 1.7.2-1+wmf11u1 from apt.wm.o (bullseye, erroneously imported in main) [18:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [18:41:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T322618)', diff saved to https://phabricator.wikimedia.org/P40279 and previous config saved to /var/cache/conftool/dbconfig/20221121-184155-ladsgroup.json [18:42:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40280 and previous config saved to /var/cache/conftool/dbconfig/20221121-184210-ladsgroup.json [18:44:14] !log reprepro -C component/dnsdist include bullseye-wikimedia dnsdist_1.7.2-1+wmf11u1_amd64.changes: T305589 [18:44:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T322618)', diff saved to https://phabricator.wikimedia.org/P40281 and previous config saved to /var/cache/conftool/dbconfig/20221121-184414-ladsgroup.json [18:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:20] T305589: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 [18:44:34] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Ottomata) @Papaul is it possible the ssds and hdds are reversed, as they were in https://phabricator.wikimedia.org/T314160#8166665 ? [18:45:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P40282 and previous config saved to /var/cache/conftool/dbconfig/20221121-184525-ladsgroup.json [18:46:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P40283 and previous config saved to /var/cache/conftool/dbconfig/20221121-184610-ladsgroup.json [18:47:05] (03CR) 10Dzahn: [C: 03+2] "query tested - reduces number of results from 30 to 10, works" [puppet] - 10https://gerrit.wikimedia.org/r/858989 (https://phabricator.wikimedia.org/T323466) (owner: 10Aklapper) [18:47:39] (03CR) 10Ottomata: "Uhh, I just found this in my draft comments in gerrit. I see the patch has been abandoned, so feel free to ignore. Just wanted to post th" [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [18:48:09] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: Reorder output for column changes [puppet] - 10https://gerrit.wikimedia.org/r/858990 (owner: 10Aklapper) [18:48:14] (03PS2) 10Dzahn: phabricator weekly changes email: Reorder output for column changes [puppet] - 10https://gerrit.wikimedia.org/r/858990 (owner: 10Aklapper) [18:50:24] (03CR) 10Dzahn: [C: 04-1] "parameter 'stats_hosts' expects an Array value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [18:51:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P40284 and previous config saved to /var/cache/conftool/dbconfig/20221121-185145-ladsgroup.json [18:52:27] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:53:06] (03CR) 10Dzahn: [C: 03+2] "query tested - it's empty currently but there is a result if the time period becomes longer" [puppet] - 10https://gerrit.wikimedia.org/r/859002 (https://phabricator.wikimedia.org/T323477) (owner: 10Aklapper) [18:53:17] (03PS2) 10Dzahn: phabricator weekly changes email: List portal changes [puppet] - 10https://gerrit.wikimedia.org/r/859002 (https://phabricator.wikimedia.org/T323477) (owner: 10Aklapper) [18:53:52] (03PS1) 10BCornwall: cp5031: Set cp role via site.pp and related config [puppet] - 10https://gerrit.wikimedia.org/r/859128 (https://phabricator.wikimedia.org/T322048) [18:55:01] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [18:55:56] (03CR) 10Ssingh: [C: 03+1] cp5031: Set cp role via site.pp and related config [puppet] - 10https://gerrit.wikimedia.org/r/859128 (https://phabricator.wikimedia.org/T322048) (owner: 10BCornwall) [18:57:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P40285 and previous config saved to /var/cache/conftool/dbconfig/20221121-185716-ladsgroup.json [18:57:51] (03CR) 10Ottomata: WIP flink image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [18:59:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P40286 and previous config saved to /var/cache/conftool/dbconfig/20221121-185920-ladsgroup.json [19:00:16] (03CR) 10BCornwall: [C: 03+2] cp5031: Set cp role via site.pp and related config [puppet] - 10https://gerrit.wikimedia.org/r/859128 (https://phabricator.wikimedia.org/T322048) (owner: 10BCornwall) [19:00:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P40287 and previous config saved to /var/cache/conftool/dbconfig/20221121-190032-ladsgroup.json [19:00:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:00:38] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:00:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:01:14] (03CR) 10Dzahn: "arr, needs rebasing but I think I went in the right order." [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) (owner: 10Aklapper) [19:01:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P40288 and previous config saved to /var/cache/conftool/dbconfig/20221121-190117-ladsgroup.json [19:02:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [19:03:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [19:03:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P40289 and previous config saved to /var/cache/conftool/dbconfig/20221121-190306-ladsgroup.json [19:04:16] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp5031.eqsin.wmnet with OS buster [19:04:24] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5031.eqsin.wmnet with OS buster [19:05:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/859094 (owner: 10Bking) [19:05:45] (03PS4) 10Dzahn: phabricator weekly changes email: List dashboard changes [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) (owner: 10Aklapper) [19:06:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:06:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P40290 and previous config saved to /var/cache/conftool/dbconfig/20221121-190652-ladsgroup.json [19:07:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P40291 and previous config saved to /var/cache/conftool/dbconfig/20221121-190702-ladsgroup.json [19:07:08] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:07:12] (03CR) 10Jgiannelos: api-gateway: expose restbase /api/ endpoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/852165 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [19:08:33] (03PS5) 10Aklapper: phabricator weekly changes email: List dashboard changes [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) [19:08:47] (03CR) 10Dzahn: [C: 03+2] "query tested. returns currently 1 dashboard (linked in ticket)" [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) (owner: 10Aklapper) [19:09:10] (03CR) 10Aklapper: "Argh, ignore patchset 5" [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) (owner: 10Aklapper) [19:10:43] (03PS6) 10Aklapper: phabricator weekly changes email: List dashboard changes [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) [19:12:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P40292 and previous config saved to /var/cache/conftool/dbconfig/20221121-191223-ladsgroup.json [19:13:13] (03CR) 10Dzahn: [C: 03+2] "ah you fixed the ticket link :) ack" [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) (owner: 10Aklapper) [19:13:35] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:14:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P40293 and previous config saved to /var/cache/conftool/dbconfig/20221121-191427-ladsgroup.json [19:15:49] (03CR) 10Dzahn: [C: 03+2] Use default mail relay for miscweb* hosts [puppet] - 10https://gerrit.wikimedia.org/r/858297 (owner: 10Muehlenhoff) [19:16:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:16:22] (03CR) 10Dzahn: [C: 03+2] "deployed!" [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) (owner: 10Aklapper) [19:16:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P40294 and previous config saved to /var/cache/conftool/dbconfig/20221121-191624-ladsgroup.json [19:16:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [19:16:30] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:16:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [19:16:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P40295 and previous config saved to /var/cache/conftool/dbconfig/20221121-191656-ladsgroup.json [19:17:39] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:17:46] (03CR) 10Aklapper: "Merci merci! <3" [puppet] - 10https://gerrit.wikimedia.org/r/858994 (https://phabricator.wikimedia.org/T323471) (owner: 10Aklapper) [19:20:43] RECOVERY - SSH on mw1332.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:21:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T323214)', diff saved to https://phabricator.wikimedia.org/P40296 and previous config saved to /var/cache/conftool/dbconfig/20221121-192158-ladsgroup.json [19:22:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [19:22:05] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [19:22:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P40297 and previous config saved to /var/cache/conftool/dbconfig/20221121-192210-ladsgroup.json [19:22:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [19:22:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [19:22:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [19:22:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T323214)', diff saved to https://phabricator.wikimedia.org/P40298 and previous config saved to /var/cache/conftool/dbconfig/20221121-192246-ladsgroup.json [19:22:55] PROBLEM - Check systemd state on ml-serve2007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P40299 and previous config saved to /var/cache/conftool/dbconfig/20221121-192446-ladsgroup.json [19:24:57] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:27:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40300 and previous config saved to /var/cache/conftool/dbconfig/20221121-192729-ladsgroup.json [19:27:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:28:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T323214)', diff saved to https://phabricator.wikimedia.org/P40301 and previous config saved to /var/cache/conftool/dbconfig/20221121-192818-ladsgroup.json [19:28:24] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [19:29:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T322618)', diff saved to https://phabricator.wikimedia.org/P40302 and previous config saved to /var/cache/conftool/dbconfig/20221121-192933-ladsgroup.json [19:29:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [19:29:47] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:30:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [19:30:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40303 and previous config saved to /var/cache/conftool/dbconfig/20221121-193006-ladsgroup.json [19:30:12] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:30:13] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST authorizationpolicies) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:31:18] (ProbeDown) firing: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:31:33] (ProbeDown) resolved: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:31:38] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5031.eqsin.wmnet with reason: host reimage [19:32:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40304 and previous config saved to /var/cache/conftool/dbconfig/20221121-193225-ladsgroup.json [19:34:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [19:34:37] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020 [19:34:42] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [19:34:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [19:34:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:34:57] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5031.eqsin.wmnet with reason: host reimage [19:35:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:35:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T322618)', diff saved to https://phabricator.wikimedia.org/P40305 and previous config saved to /var/cache/conftool/dbconfig/20221121-193512-ladsgroup.json [19:35:18] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:37:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P40306 and previous config saved to /var/cache/conftool/dbconfig/20221121-193717-ladsgroup.json [19:37:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T322618)', diff saved to https://phabricator.wikimedia.org/P40307 and previous config saved to /var/cache/conftool/dbconfig/20221121-193722-ladsgroup.json [19:39:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P40308 and previous config saved to /var/cache/conftool/dbconfig/20221121-193953-ladsgroup.json [19:43:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P40309 and previous config saved to /var/cache/conftool/dbconfig/20221121-194324-ladsgroup.json [19:43:42] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:44:08] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:44:54] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:46:01] (03PS2) 10BCornwall: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 [19:46:44] (03CR) 10BCornwall: "I also updated the regex to combine the \.prom suffix to the matches for deduplication." [alerts] - 10https://gerrit.wikimedia.org/r/858658 (owner: 10BCornwall) [19:47:10] (03CR) 10Gergő Tisza: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/858556 (owner: 10Jbond) [19:47:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P40310 and previous config saved to /var/cache/conftool/dbconfig/20221121-194731-ladsgroup.json [19:48:17] (03CR) 10CI reject: [V: 04-1] node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (owner: 10BCornwall) [19:48:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P40311 and previous config saved to /var/cache/conftool/dbconfig/20221121-195223-ladsgroup.json [19:52:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [19:52:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P40312 and previous config saved to /var/cache/conftool/dbconfig/20221121-195229-ladsgroup.json [19:52:30] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:52:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [19:52:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P40313 and previous config saved to /var/cache/conftool/dbconfig/20221121-195244-ladsgroup.json [19:55:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P40314 and previous config saved to /var/cache/conftool/dbconfig/20221121-195459-ladsgroup.json [19:56:32] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:58:14] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:58:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P40315 and previous config saved to /var/cache/conftool/dbconfig/20221121-195831-ladsgroup.json [20:01:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P40316 and previous config saved to /var/cache/conftool/dbconfig/20221121-200238-ladsgroup.json [20:03:18] RECOVERY - Check systemd state on ml-serve2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:05] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5031.eqsin.wmnet with OS buster [20:06:13] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5031.eqsin.wmnet with OS buster completed: - cp5031 (**PASS**) -... [20:07:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P40317 and previous config saved to /var/cache/conftool/dbconfig/20221121-200735-ladsgroup.json [20:10:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P40318 and previous config saved to /var/cache/conftool/dbconfig/20221121-201006-ladsgroup.json [20:10:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:10:13] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [20:10:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:11:11] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [20:13:18] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10BCornwall) [20:13:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T323214)', diff saved to https://phabricator.wikimedia.org/P40319 and previous config saved to /var/cache/conftool/dbconfig/20221121-201338-ladsgroup.json [20:13:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [20:13:45] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:13:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [20:13:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40320 and previous config saved to /var/cache/conftool/dbconfig/20221121-201359-ladsgroup.json [20:14:00] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [20:15:06] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve2007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:16:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [20:16:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [20:16:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P40321 and previous config saved to /var/cache/conftool/dbconfig/20221121-201648-ladsgroup.json [20:16:53] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [20:17:03] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@48c230a]: transfer_to_es: Allow first run of wait_for_incoming_links [20:17:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40322 and previous config saved to /var/cache/conftool/dbconfig/20221121-201747-ladsgroup.json [20:17:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [20:18:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [20:18:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T322618)', diff saved to https://phabricator.wikimedia.org/P40323 and previous config saved to /var/cache/conftool/dbconfig/20221121-201809-ladsgroup.json [20:18:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P40324 and previous config saved to /var/cache/conftool/dbconfig/20221121-201842-ladsgroup.json [20:19:17] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@48c230a]: transfer_to_es: Allow first run of wait_for_incoming_links (duration: 02m 14s) [20:20:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T322618)', diff saved to https://phabricator.wikimedia.org/P40325 and previous config saved to /var/cache/conftool/dbconfig/20221121-202027-ladsgroup.json [20:22:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T322618)', diff saved to https://phabricator.wikimedia.org/P40326 and previous config saved to /var/cache/conftool/dbconfig/20221121-202242-ladsgroup.json [20:22:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [20:22:48] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [20:22:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [20:23:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T322618)', diff saved to https://phabricator.wikimedia.org/P40327 and previous config saved to /var/cache/conftool/dbconfig/20221121-202303-ladsgroup.json [20:24:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P40328 and previous config saved to /var/cache/conftool/dbconfig/20221121-202449-ladsgroup.json [20:25:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T322618)', diff saved to https://phabricator.wikimedia.org/P40329 and previous config saved to /var/cache/conftool/dbconfig/20221121-202513-ladsgroup.json [20:29:42] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: remove phab1001 as src_host from migration class [puppet] - 10https://gerrit.wikimedia.org/r/858420 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [20:29:48] (03PS3) 10Dzahn: phabricator: remove phab1001 as src_host from migration class [puppet] - 10https://gerrit.wikimedia.org/r/858420 (https://phabricator.wikimedia.org/T323418) [20:32:26] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:33:24] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:33:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40330 and previous config saved to /var/cache/conftool/dbconfig/20221121-203349-ladsgroup.json [20:35:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T323214)', diff saved to https://phabricator.wikimedia.org/P40331 and previous config saved to /var/cache/conftool/dbconfig/20221121-203513-ladsgroup.json [20:35:19] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:35:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P40332 and previous config saved to /var/cache/conftool/dbconfig/20221121-203534-ladsgroup.json [20:38:40] (03CR) 10Dzahn: "noop. this is just used when bootstrapping new servers" [puppet] - 10https://gerrit.wikimedia.org/r/858420 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [20:39:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P40333 and previous config saved to /var/cache/conftool/dbconfig/20221121-203956-ladsgroup.json [20:40:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P40334 and previous config saved to /var/cache/conftool/dbconfig/20221121-204020-ladsgroup.json [20:48:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40335 and previous config saved to /var/cache/conftool/dbconfig/20221121-204855-ladsgroup.json [20:50:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P40336 and previous config saved to /var/cache/conftool/dbconfig/20221121-205019-ladsgroup.json [20:50:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P40337 and previous config saved to /var/cache/conftool/dbconfig/20221121-205041-ladsgroup.json [20:55:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P40338 and previous config saved to /var/cache/conftool/dbconfig/20221121-205502-ladsgroup.json [20:55:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P40339 and previous config saved to /var/cache/conftool/dbconfig/20221121-205526-ladsgroup.json [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221121T2100) [21:00:04] MatmaRex and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:15] o/ [21:00:26] hi [21:00:39] * TheresNoTime can deploy! [21:01:18] danisztls: going to start with yours :) [21:01:39] :) [21:01:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859125 (https://phabricator.wikimedia.org/T321252) (owner: 10DDesouza) [21:02:32] (03Merged) 10jenkins-bot: Deploy Research Incentive survey on swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859125 (https://phabricator.wikimedia.org/T321252) (owner: 10DDesouza) [21:02:47] !log samtar@deploy1002 Started scap: Backport for [[gerrit:859125|Deploy Research Incentive survey on swwiki (T321252)]] [21:02:53] T321252: Deploy Research Incentive Survey on Swahili Wikipedia - https://phabricator.wikimedia.org/T321252 [21:03:08] !log samtar@deploy1002 samtar and dani: Backport for [[gerrit:859125|Deploy Research Incentive survey on swwiki (T321252)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:03:10] danisztls: live on mwdebug, can you test? [21:03:16] TheresNoTime: yes [21:03:57] TheresNoTime: all is good [21:04:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P40340 and previous config saved to /var/cache/conftool/dbconfig/20221121-210402-ladsgroup.json [21:04:03] syncin' [21:04:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [21:04:08] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [21:04:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [21:04:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P40341 and previous config saved to /var/cache/conftool/dbconfig/20221121-210434-ladsgroup.json [21:05:00] (03PS2) 10DDesouza: Deploy Research Incentive survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851714 (https://phabricator.wikimedia.org/T321930) [21:05:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P40342 and previous config saved to /var/cache/conftool/dbconfig/20221121-210527-ladsgroup.json [21:05:42] (03PS3) 10DDesouza: Deploy Research Incentive survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851714 (https://phabricator.wikimedia.org/T321930) [21:05:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T322618)', diff saved to https://phabricator.wikimedia.org/P40343 and previous config saved to /var/cache/conftool/dbconfig/20221121-210547-ladsgroup.json [21:05:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [21:05:56] (03CR) 10Samtar: [C: 03+2] "start deploying" [extensions/Echo] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859071 (https://phabricator.wikimedia.org/T323491) (owner: 10Bartosz Dziewoński) [21:06:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [21:06:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40344 and previous config saved to /var/cache/conftool/dbconfig/20221121-210609-ladsgroup.json [21:08:20] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:859125|Deploy Research Incentive survey on swwiki (T321252)]] (duration: 05m 32s) [21:08:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [21:08:25] T321252: Deploy Research Incentive Survey on Swahili Wikipedia - https://phabricator.wikimedia.org/T321252 [21:08:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40345 and previous config saved to /var/cache/conftool/dbconfig/20221121-210828-ladsgroup.json [21:08:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P40346 and previous config saved to /var/cache/conftool/dbconfig/20221121-210828-ladsgroup.json [21:09:17] danisztls: that's live :) [21:09:29] (nb. just looking at mediawiki-errors a mo....) [21:09:47] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) @ottomata this SSD looks like the first disk /dev/sda below is what I have ` Virtual Disk 238: RAID1, 446.625GB, Ready Virtual Disk 239: RAID10, 21.829TB, Ready ` [21:10:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P40347 and previous config saved to /var/cache/conftool/dbconfig/20221121-211008-ladsgroup.json [21:10:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [21:10:15] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [21:10:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [21:10:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [21:10:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [21:10:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T322618)', diff saved to https://phabricator.wikimedia.org/P40348 and previous config saved to /var/cache/conftool/dbconfig/20221121-211033-ladsgroup.json [21:10:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [21:10:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [21:11:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40349 and previous config saved to /var/cache/conftool/dbconfig/20221121-211105-ladsgroup.json [21:13:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40350 and previous config saved to /var/cache/conftool/dbconfig/20221121-211316-ladsgroup.json [21:13:36] MatmaRex: moving to 859071 now, it's almost merged [21:18:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P40351 and previous config saved to /var/cache/conftool/dbconfig/20221121-211823-ladsgroup.json [21:18:29] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [21:19:08] (03Merged) 10jenkins-bot: Fix no-JS Special:Notifications only displaying one notification per day [extensions/Echo] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859071 (https://phabricator.wikimedia.org/T323491) (owner: 10Bartosz Dziewoński) [21:19:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/Echo] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859071 (https://phabricator.wikimedia.org/T323491) (owner: 10Bartosz Dziewoński) [21:19:33] !log samtar@deploy1002 Started scap: Backport for [[gerrit:859071|Fix no-JS Special:Notifications only displaying one notification per day (T323491)]] [21:19:38] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@00e5387]: incoming_links: Rename wiki to wikiid [21:19:39] T323491: No-JavaScript version of Special:Notifications only displays one notification per day - https://phabricator.wikimedia.org/T323491 [21:19:45] (03PS1) 10Stang: logos: Fix missing parts in validate() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859138 [21:19:54] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:859071|Fix no-JS Special:Notifications only displaying one notification per day (T323491)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:19:59] MatmaRex: live on mwdebug, can you test? [21:20:12] yeah [21:20:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T323214)', diff saved to https://phabricator.wikimedia.org/P40352 and previous config saved to /var/cache/conftool/dbconfig/20221121-212033-ladsgroup.json [21:20:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [21:20:39] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [21:20:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [21:20:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40353 and previous config saved to /var/cache/conftool/dbconfig/20221121-212055-ladsgroup.json [21:20:56] TheresNoTime: seems good [21:21:02] cool, syncin' [21:21:29] (03PS2) 10Stang: logos: Fix missing parts in validate() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859138 [21:21:51] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@00e5387]: incoming_links: Rename wiki to wikiid (duration: 02m 12s) [21:23:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P40354 and previous config saved to /var/cache/conftool/dbconfig/20221121-212334-ladsgroup.json [21:23:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P40355 and previous config saved to /var/cache/conftool/dbconfig/20221121-212335-ladsgroup.json [21:24:13] Hi TheresNoTime, I add two more no-op patches [21:24:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [21:24:29] cirno: okay :D [21:25:19] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:859071|Fix no-JS Special:Notifications only displaying one notification per day (T323491)]] (duration: 05m 45s) [21:25:24] T323491: No-JavaScript version of Special:Notifications only displays one notification per day - https://phabricator.wikimedia.org/T323491 [21:25:26] MatmaRex: live :) [21:25:27] (I thought they don't need a whole scap, a simple +2 is all the thing [21:25:30] thanks TheresNoTime [21:25:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858715 (owner: 10Stang) [21:26:29] (03Merged) 10jenkins-bot: Fix typo in tests/LoggingTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858715 (owner: 10Stang) [21:26:46] !log samtar@deploy1002 Started scap: Backport for [[gerrit:858715|Fix typo in tests/LoggingTest.php]] [21:27:06] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:858715|Fix typo in tests/LoggingTest.php]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:27:29] cirno: hm, I suppose you're right for tests o.o [21:28:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P40356 and previous config saved to /var/cache/conftool/dbconfig/20221121-212822-ladsgroup.json [21:29:16] (03CR) 10Samtar: [C: 03+2] logos: Fix missing parts in validate() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859138 (owner: 10Stang) [21:29:41] cirno: well one got scap'd, the other I've just +2'd [21:29:58] (03Merged) 10jenkins-bot: logos: Fix missing parts in validate() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859138 (owner: 10Stang) [21:30:11] ok [21:31:20] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:858715|Fix typo in tests/LoggingTest.php]] (duration: 04m 33s) [21:31:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [21:31:46] * TheresNoTime will be around for a bit if there's any other patches [21:33:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P40357 and previous config saved to /var/cache/conftool/dbconfig/20221121-213330-ladsgroup.json [21:35:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [21:37:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [21:38:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P40358 and previous config saved to /var/cache/conftool/dbconfig/20221121-213841-ladsgroup.json [21:38:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P40359 and previous config saved to /var/cache/conftool/dbconfig/20221121-213841-ladsgroup.json [21:42:11] !log close UTC late backport window [21:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P40360 and previous config saved to /var/cache/conftool/dbconfig/20221121-214329-ladsgroup.json [21:48:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P40361 and previous config saved to /var/cache/conftool/dbconfig/20221121-214836-ladsgroup.json [21:53:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40362 and previous config saved to /var/cache/conftool/dbconfig/20221121-215347-ladsgroup.json [21:53:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P40363 and previous config saved to /var/cache/conftool/dbconfig/20221121-215348-ladsgroup.json [21:53:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [21:53:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [21:53:53] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [21:54:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [21:54:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [21:54:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T322618)', diff saved to https://phabricator.wikimedia.org/P40364 and previous config saved to /var/cache/conftool/dbconfig/20221121-215409-ladsgroup.json [21:54:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P40365 and previous config saved to /var/cache/conftool/dbconfig/20221121-215409-ladsgroup.json [21:56:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T322618)', diff saved to https://phabricator.wikimedia.org/P40366 and previous config saved to /var/cache/conftool/dbconfig/20221121-215627-ladsgroup.json [21:57:10] (03PS14) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [21:58:00] (03CR) 10Dzahn: "I am changing the types for these 3 lists of hosts to "String". An array of Stdlib::Host seemed right because these are clearly lists of h" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [21:58:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P40367 and previous config saved to /var/cache/conftool/dbconfig/20221121-215803-ladsgroup.json [21:58:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P40368 and previous config saved to /var/cache/conftool/dbconfig/20221121-215835-ladsgroup.json [21:58:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [21:58:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [21:58:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T322618)', diff saved to https://phabricator.wikimedia.org/P40369 and previous config saved to /var/cache/conftool/dbconfig/20221121-215857-ladsgroup.json [21:59:03] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [22:00:04] Reedy, sbassett, Maryum, and manfredi: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221121T2200). [22:01:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T322618)', diff saved to https://phabricator.wikimedia.org/P40370 and previous config saved to /var/cache/conftool/dbconfig/20221121-220107-ladsgroup.json [22:03:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P40371 and previous config saved to /var/cache/conftool/dbconfig/20221121-220343-ladsgroup.json [22:03:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [22:04:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [22:04:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P40372 and previous config saved to /var/cache/conftool/dbconfig/20221121-220415-ladsgroup.json [22:04:21] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [22:08:40] (NodeTextfileStale) firing: Stale textfile for cp3060:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:09:39] (NodeTextfileStale) firing: (3) Stale textfile for cp2038:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:11:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P40373 and previous config saved to /var/cache/conftool/dbconfig/20221121-221134-ladsgroup.json [22:11:52] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:12:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P40374 and previous config saved to /var/cache/conftool/dbconfig/20221121-221205-ladsgroup.json [22:12:10] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [22:13:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P40375 and previous config saved to /var/cache/conftool/dbconfig/20221121-221310-ladsgroup.json [22:13:40] (NodeTextfileStale) firing: (11) Stale textfile for cp1077:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:14:39] (NodeTextfileStale) firing: (14) Stale textfile for cp1084:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:15:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:15:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.425 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:16:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P40376 and previous config saved to /var/cache/conftool/dbconfig/20221121-221614-ladsgroup.json [22:18:40] (NodeTextfileStale) firing: (16) Stale textfile for cp1077:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:19:39] (NodeTextfileStale) firing: (42) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:20:16] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:21:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40377 and previous config saved to /var/cache/conftool/dbconfig/20221121-222118-ladsgroup.json [22:21:22] (03PS4) 10Ottomata: [WIP] flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [22:21:25] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [22:21:27] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on phab1001.eqiad.wmnet with reason: T280597 [22:21:33] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [22:21:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab1001.eqiad.wmnet with reason: T280597 [22:21:55] !log downtiming and disabling phab1001 in preparation for migration to phab1004 (T280597) [22:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:39] !log stopping apache on phabricator machine - maintenance [22:23:40] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:39] (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:25:52] (03CR) 10Dzahn: [V: 03+1 C: 03+2] hieradata: switch active Phabricator server to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/858397 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:25:58] (03PS2) 10Dzahn: hieradata: switch active Phabricator server to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/858397 (https://phabricator.wikimedia.org/T280597) [22:26:07] (03CR) 10Dzahn: [V: 03+2] hieradata: switch active Phabricator server to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/858397 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:26:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221121-222640-ladsgroup.json [22:27:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221121-222711-ladsgroup.json [22:28:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221121-222816-ladsgroup.json [22:28:59] (03CR) 10Dzahn: [C: 03+2] phabricator: switch from phab1001 to phab1004, discovery and SPF [dns] - 10https://gerrit.wikimedia.org/r/858409 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:29:56] phab down https://downforeveryoneorjustme.com/phabricator.wikimedia.org [22:30:12] AndyRussG: see scrollback; planned maintenance. [22:30:19] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Increase small cluster heap memory from 8G to 10G [puppet] - 10https://gerrit.wikimedia.org/r/859094 (owner: 10Bking) [22:30:21] ah thx sorry brennen :) [22:30:42] (03CR) 10Bking: [C: 03+2] cloudelastic: Increase small cluster heap memory from 8G to 10G [puppet] - 10https://gerrit.wikimedia.org/r/859094 (owner: 10Bking) [22:31:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221121-223121-ladsgroup.json [22:31:25] (03PS3) 10Dzahn: phabricator: switch from phab1001 to phab1004, discovery and SPF [dns] - 10https://gerrit.wikimedia.org/r/858409 (https://phabricator.wikimedia.org/T280597) [22:31:52] (03CR) 10Dzahn: [C: 03+2] phabricator: switch from phab1001 to phab1004, discovery and SPF [dns] - 10https://gerrit.wikimedia.org/r/858409 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:33:15] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020 [22:36:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221121-223625-ladsgroup.json [22:37:42] * RhinosF1 sees phab down is expected [22:38:49] !log brennen@deploy1002 Started deploy [phabricator/deployment@f68dc24]: deploy config changes for phab1004 switch [22:39:46] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f68dc24]: deploy config changes for phab1004 switch (duration: 00m 57s) [22:41:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T322618)', diff saved to https://phabricator.wikimedia.org/P40378 and previous config saved to /var/cache/conftool/dbconfig/20221121-224146-ladsgroup.json [22:41:54] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [22:42:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P40379 and previous config saved to /var/cache/conftool/dbconfig/20221121-224218-ladsgroup.json [22:43:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P40380 and previous config saved to /var/cache/conftool/dbconfig/20221121-224322-ladsgroup.json [22:43:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [22:43:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [22:43:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P40381 and previous config saved to /var/cache/conftool/dbconfig/20221121-224355-ladsgroup.json [22:46:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T322618)', diff saved to https://phabricator.wikimedia.org/P40382 and previous config saved to /var/cache/conftool/dbconfig/20221121-224627-ladsgroup.json [22:46:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [22:46:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [22:46:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T322618)', diff saved to https://phabricator.wikimedia.org/P40383 and previous config saved to /var/cache/conftool/dbconfig/20221121-224648-ladsgroup.json [22:46:56] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [22:47:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P40384 and previous config saved to /var/cache/conftool/dbconfig/20221121-224749-ladsgroup.json [22:48:40] (03PS1) 10Jdlrobson: Update TOC to use PinnableHeader [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) [22:51:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T322618)', diff saved to https://phabricator.wikimedia.org/P40385 and previous config saved to /var/cache/conftool/dbconfig/20221121-225059-ladsgroup.json [22:51:01] (03CR) 10CI reject: [V: 04-1] Update TOC to use PinnableHeader [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson) [22:51:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P40386 and previous config saved to /var/cache/conftool/dbconfig/20221121-225131-ladsgroup.json [22:55:04] (03PS1) 10Dzahn: phabricator: set mysql master port for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/859145 [22:57:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P40387 and previous config saved to /var/cache/conftool/dbconfig/20221121-225724-ladsgroup.json [22:57:32] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [22:58:10] PROBLEM - Check systemd state on phab1004 is CRITICAL: CRITICAL - degraded: The following units failed: phd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:18] ^ resolved [23:00:10] RECOVERY - Check systemd state on phab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:33] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020 [23:02:39] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [23:02:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P40388 and previous config saved to /var/cache/conftool/dbconfig/20221121-230256-ladsgroup.json [23:06:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P40389 and previous config saved to /var/cache/conftool/dbconfig/20221121-230606-ladsgroup.json [23:06:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40390 and previous config saved to /var/cache/conftool/dbconfig/20221121-230638-ladsgroup.json [23:06:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [23:06:44] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [23:06:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [23:07:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40391 and previous config saved to /var/cache/conftool/dbconfig/20221121-230659-ladsgroup.json [23:13:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:13:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:17:21] (03PS1) 10Dzahn: phabricator: enable vcs on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/859147 [23:18:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P40392 and previous config saved to /var/cache/conftool/dbconfig/20221121-231803-ladsgroup.json [23:18:26] (03PS2) 10Dzahn: phabricator: enable vcs on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/859147 [23:21:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P40393 and previous config saved to /var/cache/conftool/dbconfig/20221121-232112-ladsgroup.json [23:21:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40394 and previous config saved to /var/cache/conftool/dbconfig/20221121-232119-ladsgroup.json [23:21:25] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [23:30:01] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:30:14] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST authorizationpolicies) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:33:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P40395 and previous config saved to /var/cache/conftool/dbconfig/20221121-233309-ladsgroup.json [23:33:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [23:33:16] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [23:33:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [23:33:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P40396 and previous config saved to /var/cache/conftool/dbconfig/20221121-233331-ladsgroup.json [23:35:01] (CirrusSearchHighOldGCFrequency) resolved: (3) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:36:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T322618)', diff saved to https://phabricator.wikimedia.org/P40397 and previous config saved to /var/cache/conftool/dbconfig/20221121-233619-ladsgroup.json [23:36:21] (03PS3) 10Dzahn: phabricator: enable vcs on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/859147 (https://phabricator.wikimedia.org/T280597) [23:36:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [23:36:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P40398 and previous config saved to /var/cache/conftool/dbconfig/20221121-233625-ladsgroup.json [23:36:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [23:36:39] (03CR) 10Dzahn: [C: 03+2] phabricator: enable vcs on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/859147 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:36:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T322618)', diff saved to https://phabricator.wikimedia.org/P40399 and previous config saved to /var/cache/conftool/dbconfig/20221121-233640-ladsgroup.json [23:37:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P40400 and previous config saved to /var/cache/conftool/dbconfig/20221121-233726-ladsgroup.json [23:38:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T322618)', diff saved to https://phabricator.wikimedia.org/P40401 and previous config saved to /var/cache/conftool/dbconfig/20221121-233851-ladsgroup.json [23:38:57] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [23:51:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P40402 and previous config saved to /var/cache/conftool/dbconfig/20221121-235132-ladsgroup.json [23:52:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P40403 and previous config saved to /var/cache/conftool/dbconfig/20221121-235232-ladsgroup.json [23:53:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P40404 and previous config saved to /var/cache/conftool/dbconfig/20221121-235357-ladsgroup.json