[00:00:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:02:36] <icinga-wm>	 RECOVERY - confd service on an-worker1145 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:05:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:07:14] <icinga-wm>	 PROBLEM - confd service on an-worker1145 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:19:32] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:30:18] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:35:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:38:04] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935879
[00:38:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935879 (owner: 10TrainBranchBot)
[00:43:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:48:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:54:51] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935879 (owner: 10TrainBranchBot)
[00:55:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:00:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:00:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:01:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:05:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:06:11] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[01:06:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:06:24] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[01:11:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:16:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:20:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:25:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:29:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:42:58] <icinga-wm>	 PROBLEM - puppet last run on an-worker1145 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:43:21] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:06] <icinga-wm>	 PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:01:36] <icinga-wm>	 RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:08:21] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:28:21] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:29:19] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:37:15] <wikibugs>	 10SRE, 10Observability-Metrics: Tooling for end-of-quarter SLO reporting - https://phabricator.wikimedia.org/T290924 (10lmata)
[02:41:32] <wikibugs>	 (03PS1) 10RLazarus: opentelemetry-collector: Vendor 0.62.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936388 (https://phabricator.wikimedia.org/T324117)
[02:41:34] <wikibugs>	 (03PS1) 10RLazarus: opentelemetry-collector: Fix image and entry point [deployment-charts] - 10https://gerrit.wikimedia.org/r/936389 (https://phabricator.wikimedia.org/T320564)
[02:49:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:53:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:16:12] <wikibugs>	 (03Restored) 10Anzx: Enable tabs for non logged in users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932284 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx)
[03:18:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:22:27] <wikibugs>	 (03PS2) 10Anzx: Enable tabs for non logged in users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932284 (https://phabricator.wikimedia.org/T340276)
[03:24:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:44:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:58:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:08:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:24:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:34:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:38:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:13:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:37:16] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341437 (10phaultfinder)
[05:37:18] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341438 (10phaultfinder)
[05:42:15] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341438 (10phaultfinder)
[05:42:17] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341438 (10phaultfinder)
[05:43:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:48:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:06:05] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-07-06-065912-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936391 (https://phabricator.wikimedia.org/T340989)
[06:10:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-debug-repl: improve UX (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936280 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto)
[06:16:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:21:12] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "need some work" [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[06:21:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:22:17] <wikibugs>	 (03PS5) 10MdsShakil: Deploy action blocks on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934614 (https://phabricator.wikimedia.org/T340904)
[06:26:58] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::repl: allow execution from everyone [puppet] - 10https://gerrit.wikimedia.org/r/936394 (https://phabricator.wikimedia.org/T341197)
[06:30:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:33:21] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:37:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::repl: allow execution from everyone [puppet] - 10https://gerrit.wikimedia.org/r/936394 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto)
[06:41:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff)
[06:43:20] <godog>	 !log add 100G to prometheus/k8s in codfw
[06:43:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:45:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet
[06:50:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff)
[06:50:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff
[06:55:03] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2023-07-10-065135-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936391 (https://phabricator.wikimedia.org/T337719)
[06:55:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet
[06:56:30] <wikibugs>	 (03PS2) 10JMeybohm: Add AppArmor configuration for the deployed function-evaluator service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (https://phabricator.wikimedia.org/T326785) (owner: 10Cory Massaro)
[06:57:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add AppArmor configuration for the deployed function-evaluator service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (https://phabricator.wikimedia.org/T326785) (owner: 10Cory Massaro)
[06:58:04] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:00:06] <jouncebot>	 Amir1, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T0700).
[07:00:06] <jouncebot>	 Func and MdsShakil: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:10] <MdsShakil>	 Hi :)
[07:00:13] <Func>	 o/
[07:00:26] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms
[07:01:06] <wikibugs>	 (03CR) 10JMeybohm: "As for the actual profile: That needs to be shipped via puppet IIRC - I don't think that has been implemented yet." [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (https://phabricator.wikimedia.org/T326785) (owner: 10Cory Massaro)
[07:01:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet
[07:01:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet
[07:02:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet
[07:02:57] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: add gateway routing script, route device-analytics on cp2037 [puppet] - 10https://gerrit.wikimedia.org/r/936509 (https://phabricator.wikimedia.org/T320967)
[07:04:50] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:05:00] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:05:00] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42357/console" [puppet] - 10https://gerrit.wikimedia.org/r/936509 (https://phabricator.wikimedia.org/T320967) (owner: 10Vgutierrez)
[07:05:06] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:05:16] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:05:40] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:05:46] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:05:58] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:09:38] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:09:44] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:09:54] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:10:20] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:10:26] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:10:38] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:11:10] <MdsShakil>	 Zzzzzzzz
[07:15:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:15:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:18:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:19:13] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[07:20:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync
[07:20:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync
[07:21:08] <hashar>	 !log deploy1002: removed empty untracked directory from MediaWiki staging area: `rmdir /srv/mediawiki-staging/wmf-config/scap/log/ && rmdir /srv/mediawiki-staging/wmf-config/scap/` | T341292
[07:21:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:11] <stashbot>	 T341292: scap backport should remove code for removed submodules - https://phabricator.wikimedia.org/T341292
[07:21:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
[07:21:24] <hashar>	 not synced cause they are empty directories not holding any code
[07:21:37] <hashar>	 left over from a 2016 deploy of some sort
[07:22:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
[07:22:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet
[07:22:38] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) I think for wikiwand we only allow requests based on referer should we add or replace the rule with the user agent?
[07:23:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:26:35] <Func>	 hashar: Hi, could you help with the backport window?
[07:27:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:27:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, I'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251) (owner: 10Hashar)
[07:27:46] <icinga-wm>	 PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:28:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff)
[07:29:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet
[07:29:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet
[07:29:20] <icinga-wm>	 RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:29:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
[07:30:14] <moritzm>	 !log installing libgstreamer-plugins-base1.0-0 security updates
[07:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
[07:30:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet
[07:32:23] <hashar>	 Func: yes!
[07:32:34] <hashar>	 jouncebot: now
[07:32:34] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T0700)
[07:32:50] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[07:32:56] <hashar>	 I guess nobody is running it, so I will
[07:33:05] <Func>	 thanks
[07:33:15] <hashar>	 sorry Func and MdsShakil , I usually don't run the backport window and thus haven't thought about checking the patches this morning
[07:33:40] * hashar grab coffee number N+1
[07:33:58] <icinga-wm>	 PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:35:32] <icinga-wm>	 RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:35:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935876 (https://phabricator.wikimedia.org/T341407) (owner: 10Func)
[07:35:46] <hashar>	 Func: doing it :)
[07:36:10] <wikibugs>	 (03PS1) 10Elukey: services: allow kafka batches in EventGate's main producer [deployment-charts] - 10https://gerrit.wikimedia.org/r/936515 (https://phabricator.wikimedia.org/T338357)
[07:36:24] <wikibugs>	 (03Merged) 10jenkins-bot: thwiki: Update logos from commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935876 (https://phabricator.wikimedia.org/T341407) (owner: 10Func)
[07:36:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add bookworm to the local build configurations [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935693 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto)
[07:36:51] <logmsgbot>	 !log hashar@deploy1002 Started scap: Backport for [[gerrit:935876|thwiki: Update logos from commons (T341407)]]
[07:36:54] <stashbot>	 T341407: Update th.wikipedia.org logo - https://phabricator.wikimedia.org/T341407
[07:37:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] images: convert use of seed_image into use of image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935694 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto)
[07:38:48] <wikibugs>	 (03CR) 10Elukey: "The 5ms setting is the default for node-rdkafka, basically what's suggested by upstream. I should improve things on the kafka main eqiad s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/936515 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[07:39:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] istio: convert use of seed_image into use of image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935695 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto)
[07:39:54] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Increase mentorship percentage to 25% on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936639 (https://phabricator.wikimedia.org/T341399)
[07:40:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] istio: convert use of seed_image into use of image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935695 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto)
[07:41:41] <hashar>	 well it is pushing a 5 GBytes docker image at 5MB/s so that is taking a bit of time
[07:41:43] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] cert-manager: convert use of seed_image to image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935696 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto)
[07:42:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hints for gst-plugins-base1.0 [puppet] - 10https://gerrit.wikimedia.org/r/936649
[07:45:57] <logmsgbot>	 !log hashar@deploy1002 func and hashar: Backport for [[gerrit:935876|thwiki: Update logos from commons (T341407)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[07:46:00] <stashbot>	 T341407: Update th.wikipedia.org logo - https://phabricator.wikimedia.org/T341407
[07:46:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hints for gst-plugins-base1.0 [puppet] - 10https://gerrit.wikimedia.org/r/936649 (owner: 10Muehlenhoff)
[07:46:57] <Func>	 hashar: confirmed fixed
[07:47:12] <hashar>	 Func: thank you for the confirmation! :]
[07:47:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet
[07:53:38] <hashar>	 MdsShakil: I am deploying your change for "Deploy action blocks on bnwiki"
[07:54:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934614 (https://phabricator.wikimedia.org/T340904) (owner: 10MdsShakil)
[07:54:02] <MdsShakil>	 I am around :)
[07:54:11] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] contint: replace Apache 2.2 access control syntax for Jenkins proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[07:54:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet
[07:54:43] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy action blocks on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934614 (https://phabricator.wikimedia.org/T340904) (owner: 10MdsShakil)
[07:54:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet
[07:54:46] <hashar>	 ah great
[07:55:05] <hashar>	 well I have some issue with the deployment tool unfortunately
[07:55:16] <hashar>	 it thinks the previous change is still being deployed :]
[07:56:20] <hashar>	 Func: I forgot scap was waiting for the test on mwdebug, so I am now rolling the thai logo update to everything
[07:56:46] <hashar>	 I have too many windows
[07:58:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet
[08:00:42] <moritzm>	 !log installing flask security updates on bullseye
[08:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:23] <logmsgbot>	 !log hashar@deploy1002 Finished scap: Backport for [[gerrit:935876|thwiki: Update logos from commons (T341407)]] (duration: 25m 32s)
[08:02:27] <stashbot>	 T341407: Update th.wikipedia.org logo - https://phabricator.wikimedia.org/T341407
[08:02:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:02:44] <hashar>	 MdsShakil: finally doing your change :)
[08:02:59] <logmsgbot>	 !log hashar@deploy1002 Started scap: Backport for [[gerrit:934614|Deploy action blocks on bnwiki (T340904)]]
[08:03:02] <stashbot>	 T340904: Deploy action blocks on bnwiki - https://phabricator.wikimedia.org/T340904
[08:03:35] <hashar>	 MdsShakil: which I get can be deployed entirely or do you want to test it?
[08:03:45] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442 (10Joe)
[08:04:15] <MdsShakil>	 hashar: Your preference :)
[08:04:21] <logmsgbot>	 !log hashar@deploy1002 hashar and mdsshakil: Backport for [[gerrit:934614|Deploy action blocks on bnwiki (T340904)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[08:04:25] <moritzm>	 !log installing c-ares security updates on buster
[08:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:51] <hashar>	 MdsShakil: it is on mwdebug servers if you wanna test :]
[08:05:02] <hashar>	 given I don't know anything about that feature
[08:05:14] <MdsShakil>	 Looks good 
[08:05:23] <hashar>	 lets go!
[08:06:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Urbanecm_WMF)
[08:07:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:07:45] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "trafficserver: Send testwiki traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/936417
[08:07:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "trafficserver: Send testwiki traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/936417 (owner: 10Clément Goubert)
[08:08:04] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Urbanecm_WMF)
[08:09:15] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10RhinosF1) I'm pretty sure to be in 'wmf' a @wikimedia.org email needs to be linked.  Looks like your ldap account is @wikimedia.cz
[08:09:36] <wikibugs>	 (03PS2) 10Clément Goubert: Revert "trafficserver: Send testwiki traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/936417
[08:10:14] <kart_>	 hashar: Let me know when you done with backport. I plan to deploy cxserver.
[08:10:24] <hashar>	 it is almost complete
[08:10:37] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442 (10Joe) Things that I don't think we have to create such a cookbook:  * programmatic way to merge changes in gerrit. I'm not sure if this could have some...
[08:11:15] <logmsgbot>	 !log hashar@deploy1002 Finished scap: Backport for [[gerrit:934614|Deploy action blocks on bnwiki (T340904)]] (duration: 08m 15s)
[08:11:19] <stashbot>	 T340904: Deploy action blocks on bnwiki - https://phabricator.wikimedia.org/T340904
[08:11:22] <hashar>	 !log UTC morning backport window completed.
[08:11:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:26] <hashar>	 kart_: all your :-]
[08:15:06] <kart_>	 hashar: Thanks.
[08:15:09] <wikibugs>	 (03CR) 10Hashar: "recheck after https://gerrit.wikimedia.org/r/c/integration/config/+/936296/" [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936297 (owner: 10Ssingh)
[08:15:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Urbanecm_WMF) >>! In T341443#9000250, @RhinosF1 wrote: > I'm pretty sure to be in 'wmf' a @wikimedia.org email needs to be linked.  Done.
[08:16:10] <kart_>	 There is undeployed change "mesh.configuration: Update all charts to 1.3.2" in cxserver (and probably other services also). Is that OK to go ahead with this? _joe_ akosiaris?
[08:16:32] <claime>	 jayme: ^
[08:16:40] <_joe_>	 kart_: jayme is who you want to ask to :D
[08:16:51] <jayme>	 kart_: yes please!
[08:17:10] <kart_>	 Cool. Thanks!
[08:17:44] <jayme>	 "should be the last one for some time" 😇
[08:18:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "minor nit, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/936417 (owner: 10Clément Goubert)
[08:19:14] <_joe_>	 jayme: lol
[08:19:20] <wikibugs>	 (03PS3) 10Clément Goubert: Revert "trafficserver: Send testwiki traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/936417
[08:19:32] <claime>	 "we should be fine and stable now"
[08:19:35] <claime>	 x)
[08:19:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "trafficserver: Send testwiki traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/936417 (owner: 10Clément Goubert)
[08:20:02] <jayme>	 for this week indeed :-p
[08:20:16] <claime>	 Yes, please keep everything fine and stable this week
[08:20:18] <claime>	 I am on call
[08:20:20] <claime>	 :P
[08:20:36] <_joe_>	 claime: oh then elukey has some surprises for you
[08:20:54] <claime>	 Can he have these surprises tomorrow
[08:20:58] <claime>	 I'm not on call tomorrow
[08:21:00] <claime>	 :D
[08:21:02] <_joe_>	 lol
[08:21:13] <_joe_>	 elukey: please hurry with your changes
[08:21:40] * claime groans at _joe_'s conception of a birthday present
[08:22:33] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "trafficserver: Send testwiki traffic to mw-on-k8s" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936417 (owner: 10Clément Goubert)
[08:22:57] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-07-10-065135-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936391 (https://phabricator.wikimedia.org/T337719) (owner: 10KartikMistry)
[08:23:54] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-07-10-065135-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936391 (https://phabricator.wikimedia.org/T337719) (owner: 10KartikMistry)
[08:24:36] <claime>	 !log Running puppet on cp-text hosts - T337489
[08:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:39] <stashbot>	 T337489: Run QTE test suite on testwiki on kubernetes - https://phabricator.wikimedia.org/T337489
[08:25:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet
[08:26:12] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:39] <elukey>	 claime: o/
[08:26:59] <claime>	 elukey: \o
[08:27:03] <wikibugs>	 (03PS1) 10Btullis: Use an internal schema registry for datahub on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936651 (https://phabricator.wikimedia.org/T329514)
[08:27:14] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[08:27:19] <elukey>	 claime: to be fair you nerd-sniped me into the task so if I produce code reviews during your on-call shift is only karma :)
[08:27:35] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[08:27:50] <claime>	 elukey: the changeprop task ?
[08:28:02] <elukey>	 claime: correct yes, you have a code review for eventgate :)
[08:28:16] <claime>	 elukey: https://www.youtube.com/watch?v=hd1ciPnTGKg
[08:29:21] <elukey>	 lol
[08:31:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet
[08:31:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet
[08:32:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet
[08:32:56] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Fix CirrusSearchJobQueueLagTooHigh to use histograms [alerts] - 10https://gerrit.wikimedia.org/r/936070 (owner: 10Alexandros Kosiaris)
[08:33:42] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:25] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[08:41:07] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[08:44:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-codfw
[08:45:36] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[08:46:10] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[08:47:42] <kart_>	 !log Updated cxserver to 2023-07-10-065135-production (T337719, T340989)
[08:47:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:46] <stashbot>	 T340989: MinT not working for Bhojpuri in Content & Section Translation - https://phabricator.wikimedia.org/T340989
[08:47:47] <stashbot>	 T337719: CX: Replace calls to the deprecated mobile content REST API - https://phabricator.wikimedia.org/T337719
[08:48:11] <moritzm>	 !log installing libxpm security updates
[08:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:13] <Lucas_WMDE>	 I’ll deploy a security patch if that’s alright with everyone
[08:51:21] <Lucas_WMDE>	 going ahead
[08:54:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet
[08:55:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-codfw
[08:55:51] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[08:56:46] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:57:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-eqiad
[08:57:52] <icinga-wm>	 PROBLEM - Host kubetcd1005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:58:02] <logmsgbot>	 !log lucaswerkmeister-wmde: Deployed security patch for T340220
[08:59:09] <wikibugs>	 (03PS2) 10Btullis: Use an internal schema registry for datahub on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936651 (https://phabricator.wikimedia.org/T329514)
[08:59:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: eqiad: bootstrap hiera data [puppet] - 10https://gerrit.wikimedia.org/r/936022 (https://phabricator.wikimedia.org/T341200) (owner: 10Arturo Borrero Gonzalez)
[08:59:13] * Lucas_WMDE done
[08:59:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:00:30] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms
[09:00:34] <icinga-wm>	 RECOVERY - Host kubetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms
[09:00:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet
[09:00:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet
[09:01:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet
[09:04:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:04:39] <moritzm>	 !log rebalance ganeti clusters in esams/ulsfo/eqsin following reboots
[09:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:04] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb1001.eqiad.wmnet with OS bullseye
[09:06:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet...
[09:07:44] <moritzm>	 !log installing cups security updates (libs only)
[09:07:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/936372 (owner: 10Majavah)
[09:08:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mailmap: expand mailmap [puppet] - 10https://gerrit.wikimedia.org/r/936372 (owner: 10Majavah)
[09:08:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-eqiad
[09:10:23] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) p:05Triage→03High
[09:11:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff)
[09:12:01] <moritzm>	 !log restarting mw canaries to pick up libxpm security update
[09:12:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:03] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Use an internal schema registry for datahub on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936651 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[09:14:11] <wikibugs>	 (03Merged) 10jenkins-bot: Use an internal schema registry for datahub on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936651 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[09:14:12] <vgutierrez>	 !log depool cp2037 (debugging ATS cacheability issues) - T320967
[09:14:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:16] <stashbot>	 T320967: [AQS 2.0] New Service Request device_analytics - https://phabricator.wikimedia.org/T320967
[09:14:41] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: add gateway routing script, route device-analytics on cp2037 [puppet] - 10https://gerrit.wikimedia.org/r/936509 (https://phabricator.wikimedia.org/T320967) (owner: 10Vgutierrez)
[09:15:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/934371 (owner: 10Muehlenhoff)
[09:16:31] <wikibugs>	 (03PS2) 10Elukey: profile::kafka: update prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357)
[09:17:43] <wikibugs>	 (03CR) 10Elukey: "Tested on kafka-test1006, the metrics are displayed correctly:" [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[09:20:36] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic, 10netops: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10Volans) Adding #traffic for awareness.
[09:20:41] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic, 10netops: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10Volans)
[09:22:13] <wikibugs>	 (03CR) 10Jbond: "@Kieth, feel free to merge this yuor self if you are happy or we can do it together when you are online" [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond)
[09:23:03] <moritzm>	 !log rebalance ganeti group codfw/A after reboots
[09:23:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet
[09:25:09] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[09:25:11] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] services: allow kafka batches in EventGate's main producer [deployment-charts] - 10https://gerrit.wikimedia.org/r/936515 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[09:25:50] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:25:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mirrors: Add monitoring for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff)
[09:26:40] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/934371 (owner: 10Muehlenhoff)
[09:26:42] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:00] <icinga-wm>	 PROBLEM - Host kubestagetcd1004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:03] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb1001 - aborrero@cumin1001"
[09:29:34] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb1001 - aborrero@cumin1001"
[09:29:34] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:30:36] <icinga-wm>	 RECOVERY - Host kubestagetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms
[09:30:56] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms
[09:31:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet
[09:31:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet
[09:31:26] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms
[09:31:42] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[09:33:43] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb1002 - aborrero@cumin1001"
[09:33:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Very good job overall! Your tests don't pass because you need to provide a list of kafka brokers to your tests for deployments, that's don" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (owner: 10Kamila Součková)
[09:33:56] <icinga-wm>	 PROBLEM - purged service on cp2037 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:34:11] <wikibugs>	 (03PS1) 10Btullis: Enable the kafka-setup job for datahub in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936656 (https://phabricator.wikimedia.org/T329514)
[09:34:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[09:34:51] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Urbanecm_WMF)
[09:35:04] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb1002 - aborrero@cumin1001"
[09:35:04] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:35:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::kafka: update prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[09:35:28] <icinga-wm>	 RECOVERY - purged service on cp2037 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:35:35] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb1001.eqiad.wmnet with reason: host reimage
[09:37:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Puppet Profiler - https://phabricator.wikimedia.org/T341448 (10jbond)
[09:37:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Puppet Profiler - https://phabricator.wikimedia.org/T341448 (10jbond) p:05Triage→03Medium
[09:38:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet
[09:38:42] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb1001.eqiad.wmnet with reason: host reimage
[09:39:16] <icinga-wm>	 ACKNOWLEDGEMENT - Host mr1-eqsin.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2403:b100:3001:9::2) Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:38:54.
[09:39:26] <wikibugs>	 10SRE, 10Observability-Alerting, 10Traffic, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi >>! In T341039#8995349, @Aklapper wrote: > Hmm. The problem //could// be rel...
[09:39:28] <icinga-wm>	 ACKNOWLEDGEMENT - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:39:17.
[09:39:32] <moritzm>	 !log rebalance ganeti group codfw/B after reboots
[09:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:46] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on asw1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.132.128.4 Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:39:33. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[09:39:46] <icinga-wm>	 ACKNOWLEDGEMENT - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:39:33.
[09:40:02] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr2-eqsin is CRITICAL: Down: 1 Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:39:52. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:40:12] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:40:03. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:40:24] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:40:13. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:41:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: open radosgw API to the internet [puppet] - 10https://gerrit.wikimedia.org/r/936657 (https://phabricator.wikimedia.org/T341380)
[09:41:14] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-604-eqsin-infeed-load-tower-B-single-phase on ps1-604-eqsin is CRITICAL: CRITICAL - Plugin timed out while executing system call Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:41:14] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-604-eqsin-infeed-load-tower-A-single-phase on ps1-604-eqsin is CRITICAL: CRITICAL - Plugin timed out while executing system call Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:41:27] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-603-eqsin-infeed-load-tower-B-single-phase on ps1-603-eqsin is CRITICAL: CRITICAL - Plugin timed out while executing system call Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:16. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:41:27] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-603-eqsin-infeed-load-tower-A-single-phase on ps1-603-eqsin is CRITICAL: CRITICAL - Plugin timed out while executing system call Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:16. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:41:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::java: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/934371 (owner: 10Muehlenhoff)
[09:41:52] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:42. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:41:52] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:42. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[09:44:50] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.wipe-cache cloudlb1002.private.eqiad.wikimedia.cloud on all recursors
[09:44:53] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb1002.private.eqiad.wikimedia.cloud on all recursors
[09:45:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable the kafka-setup job for datahub in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936656 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[09:46:47] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the kafka-setup job for datahub in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936656 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[09:49:24] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338)
[09:49:41] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936373 (https://phabricator.wikimedia.org/T325466) (owner: 10Majavah)
[09:50:31] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp2037.codfw.wmnet with reason: vgutierrez debugging
[09:50:32] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936376 (https://phabricator.wikimedia.org/T325466) (owner: 10Majavah)
[09:50:44] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp2037.codfw.wmnet with reason: vgutierrez debugging
[09:52:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:52:51] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1033.eqiad.wmnet
[09:52:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet
[09:53:00] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[09:53:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:53:21] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1033.eqiad.wmnet
[09:56:09] <claime>	 Checking parsoid latency
[09:56:31] <claime>	 Because it's getting like 300 rps so it shouldn't really be overloaded...
[09:57:14] <wikibugs>	 (03PS1) 10Btullis: Bump the datahub top-level chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/936658 (https://phabricator.wikimedia.org/T329514)
[09:58:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:59:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:59:30] <claime>	 It's hovering right around the threshold and flapping
[10:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1000)
[10:01:59] <claime>	 It actually started ramping up during the night and hasn't really come down
[10:02:28] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin1001"
[10:03:11] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin1001"
[10:03:12] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb1001.eqiad.wmnet with OS bullseye
[10:03:19] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye completed...
[10:05:32] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb1002.eqiad.wmnet with OS bullseye
[10:11:36] <logmsgbot>	 !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb1002.eqiad.wmnet with OS bullseye
[10:11:39] <wikibugs>	 (03PS1) 10Majavah: P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663
[10:11:41] <wikibugs>	 (03PS1) 10Majavah: P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664
[10:11:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet
[10:12:27] <claime>	 !log repooling parse1012.eqiad.wmnet
[10:12:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:36] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=parsoid,name=parse1012.*
[10:13:07] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb1002.eqiad.wmnet with OS bullseye
[10:13:32] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on parse1012 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[10:14:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet
[10:19:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:21:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet
[10:21:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet
[10:21:46] <wikibugs>	 (03PS5) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798
[10:21:48] <wikibugs>	 (03PS1) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[10:22:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[10:23:09] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump the datahub top-level chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/936658 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[10:23:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:23:53] <wikibugs>	 (03Merged) 10jenkins-bot: Bump the datahub top-level chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/936658 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[10:25:28] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb1002.eqiad.wmnet with reason: host reimage
[10:26:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet
[10:26:21] <wikibugs>	 (03PS2) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[10:27:10] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:28:23] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:28:39] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb1002.eqiad.wmnet with reason: host reimage
[10:31:07] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "trafficserver: add gateway routing script, route device-analytics on cp2037" [puppet] - 10https://gerrit.wikimedia.org/r/936422
[10:33:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:33:21] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:34:37] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[10:35:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:38:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "trafficserver: add gateway routing script, route device-analytics on cp2037" [puppet] - 10https://gerrit.wikimedia.org/r/936422 (owner: 10Vgutierrez)
[10:40:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: allow kafka batches in EventGate's main producer [deployment-charts] - 10https://gerrit.wikimedia.org/r/936515 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[10:42:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:43:25] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp2037.codfw.wmnet
[10:43:25] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2037.codfw.wmnet
[10:43:48] <wikibugs>	 (03PS1) 10Hashar: Review access change [dns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/936423
[10:44:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync
[10:44:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync
[10:45:12] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[10:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:45:44] <wikibugs>	 (03PS2) 10Hashar: Grant permission to ldap/dns-admins [dns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/936423 (https://phabricator.wikimedia.org/T341440)
[10:46:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [dns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/936423 (https://phabricator.wikimedia.org/T341440) (owner: 10Hashar)
[10:46:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet
[10:46:47] <wikibugs>	 (03CR) 10Hashar: [V: 03+2 C: 03+2] Grant permission to ldap/dns-admins [dns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/936423 (https://phabricator.wikimedia.org/T341440) (owner: 10Hashar)
[10:47:02] <_joe_>	 claime: looks like parsoid's latency went down suddenly
[10:47:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:49:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10Ifrahkhanyaree)
[10:49:31] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb1002.eqiad.wmnet with OS bullseye
[10:50:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync
[10:50:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10WMDE-leszek) I confirm Ifrah uses the account mentioned and she's a Product Manager employed at WMDE. Thank you for processing the request.
[10:50:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync
[10:51:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:51:37] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:53:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet
[10:54:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10hashar) The Gerrit configuration change grants members of dns-admins {nav Code-Review +2} and {nav Submit} which should be all what is needed. Note t...
[10:55:05] <moritzm>	 !log failover ganeti master in eqiad to ganeti1029
[10:55:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:57:29] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:57:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:58:19] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti1028 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[10:59:23] <icinga-wm>	 PROBLEM - HTTPS Ganeti RAPI eqiad on ganeti1028 is CRITICAL: connect to address ganeti01.svc.eqiad.wmnet and port 5080: No route to host https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon
[11:00:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:02:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:03:21] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:05:36] <wikibugs>	 (03PS1) 10Btullis: Disable the kafka-setup job in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936670 (https://phabricator.wikimedia.org/T329514)
[11:05:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet
[11:09:35] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqiad_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:09:44] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Disable the kafka-setup job in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936670 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:10:30] <wikibugs>	 (03Merged) 10jenkins-bot: Disable the kafka-setup job in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936670 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:10:49] <wikibugs>	 (03PS3) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[11:11:05] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti6003.drmrs.wmnet
[11:11:43] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[11:12:21] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:13:16] <wikibugs>	 (03PS4) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[11:14:27] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:14:32] <moritzm>	 !log remove unused VM netflow6002 T330884
[11:14:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:35] <stashbot>	 T330884: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884
[11:14:53] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure: Puppet package_builder module should have the apt cache auto cleaned - https://phabricator.wikimedia.org/T339251 (10hashar) Should be good now. I have previously removed all caches from the CI instances so it is unlikely we can check the result of this change the...
[11:15:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 28 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP
[11:15:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 28 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP
[11:16:01] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:16:13] <wikibugs>	 (03PS5) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[11:22:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Add dns-admins to list of sensitive groups [puppet] - 10https://gerrit.wikimedia.org/r/936674 (https://phabricator.wikimedia.org/T341440)
[11:22:49] <wikibugs>	 (03PS2) 10Vivian Rook: P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[11:23:20] <wikibugs>	 (03PS1) 10Btullis: Use plaintext port 8080 for local schema registry in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936675 (https://phabricator.wikimedia.org/T329514)
[11:23:24] <wikibugs>	 (03PS2) 10Vivian Rook: P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[11:23:35] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[11:23:38] <wikibugs>	 (03CR) 10Vivian Rook: [C: 03+1] P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[11:23:44] <wikibugs>	 (03CR) 10Vivian Rook: [C: 03+1] P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[11:24:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) The new group has been documented under https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#Primary_groups
[11:24:58] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Use plaintext port 8080 for local schema registry in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936675 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:25:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[11:26:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[11:26:04] <wikibugs>	 (03Merged) 10jenkins-bot: Use plaintext port 8080 for local schema registry in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936675 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:26:57] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:27:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) @Jgreen and @Dwisehaupt I have removed you from the cn=ops LDAP group and added you to cn=dns-admins (which has the permissions to...
[11:28:41] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[11:28:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet
[11:29:07] <wikibugs>	 (03PS3) 10Majavah: P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459)
[11:29:09] <wikibugs>	 (03PS3) 10Majavah: P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459)
[11:29:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet
[11:30:30] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[11:31:44] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:34:02] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[11:35:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet
[11:36:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6003.drmrs.wmnet
[11:36:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:38:53] <wikibugs>	 (03CR) 10Vivian Rook: [C: 03+1] P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[11:38:58] <wikibugs>	 (03CR) 10Vivian Rook: [C: 03+1] P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah)
[11:39:26] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935882
[11:41:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:42:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet
[11:49:21] <wikibugs>	 (03PS1) 10Jgreen: Remove payments-listener-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/936686 (https://phabricator.wikimedia.org/T340128)
[11:49:59] <claime>	 _joe_: It went down right after I re-added parse1012, then went back up
[11:50:04] <claime>	 It's being really spiky
[11:51:31] <wikibugs>	 (03PS1) 10Jbond: config-master: drop ssh-fingerprints.txt  file [puppet] - 10https://gerrit.wikimedia.org/r/936691 (https://phabricator.wikimedia.org/T340947)
[11:51:33] <wikibugs>	 (03PS1) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[11:52:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[11:52:12] <vgutierrez>	 !log repool cp2037 (debugging finished) - T320967
[11:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:15] <stashbot>	 T320967: [AQS 2.0] New Service Request device_analytics - https://phabricator.wikimedia.org/T320967
[11:53:56] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:54:04] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Remove payments-listener-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/936686 (https://phabricator.wikimedia.org/T340128) (owner: 10Jgreen)
[11:54:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] config-master: drop ssh-fingerprints.txt  file [puppet] - 10https://gerrit.wikimedia.org/r/936691 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[11:55:31] <moritzm>	 !log installing avahi security updates
[11:55:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:44] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:55:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet
[11:56:30] <wikibugs>	 (03PS6) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[11:57:10] <wikibugs>	 (03PS2) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[11:57:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[11:57:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10Jgreen) >>! In T341440#9000902, @MoritzMuehlenhoff wrote: > @Jgreen and @Dwisehaupt I have removed you from the cn=ops LDAP group and added you to cn...
[11:58:36] <wikibugs>	 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Vgutierrez)
[11:58:53] <wikibugs>	 10SRE: Cronspam from acmechief-test1001 - https://phabricator.wikimedia.org/T295770 (10Vgutierrez) 05Open→03Resolved Instance haven't produced cronspam since Nov 2021
[12:01:36] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:02:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) Hmmh, won't you need additional sudo privileges to run dnsauth-update? Or did you trigger this indirectly via the sre.dns.netbox c...
[12:02:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet
[12:02:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet
[12:04:55] <moritzm>	 !log failover ganeti masters in drmrs
[12:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:06] <wikibugs>	 (03PS7) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[12:08:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[12:09:10] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[12:09:32] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[12:09:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10Jgreen) >>! In T341440#9000960, @MoritzMuehlenhoff wrote: > Hmmh, won't you need additional sudo privileges to run dnsauth-update? Or did you trigger...
[12:10:07] <claime>	 topranks: I assume your email for CRT-009240 is related to the cr1 alerts above?
[12:10:57] <topranks>	 yep the cr1-eqiad and cr1-drmrs alerts 
[12:11:14] <topranks>	 I've noticed the cloudsw one now also, that's probably new host cloudlb but I'll have a look 
[12:12:09] <claime>	 ack
[12:12:26] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:16:16] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338)
[12:18:00] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Direct 0.5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341078 (10Clement_Goubert) 05In progress→03Resolved
[12:18:12] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[12:18:24] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Phabricator, 10collaboration-services, 10serviceops-radar, and 2 others: Phabricator: Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Aklapper) 05Stalled→03Open >>! In T313879#8531556, @LSobanski wrote: > To be inves...
[12:18:34] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez)
[12:18:38] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez)
[12:19:15] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert)
[12:19:23] <wikibugs>	 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10BTullis) p:05Triage→03High
[12:19:32] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[12:19:44] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High
[12:20:23] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338)
[12:20:57] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338)
[12:21:09] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Redirect 1% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/936697 (https://phabricator.wikimedia.org/T341078)
[12:21:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez)
[12:22:07] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez)
[12:22:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez)
[12:22:30] <wikibugs>	 (03PS2) 10Clément Goubert: mw-on-k8s: Redirect 1% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/936697 (https://phabricator.wikimedia.org/T341078)
[12:24:27] <godog>	 btullis: FYI datahub-mae-consumer-main container is spamming a ton of exceptions in logs on kubestage
[12:25:03] <btullis>	 godog: Sorry, will destroy the deployment now.
[12:25:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:25:24] <godog>	 btullis: ack, thank you
[12:26:15] <btullis>	 godog: done.
[12:26:54] <wikibugs>	 (03PS3) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[12:27:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Redirect 1% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/936697 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert)
[12:27:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[12:28:00] <godog>	 cheers
[12:29:43] <wikibugs>	 (03PS3) 10Clément Goubert: mw-on-k8s: Redirect 1% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/936697 (https://phabricator.wikimedia.org/T341463)
[12:30:08] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:30:52] <wikibugs>	 (03PS4) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[12:31:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[12:32:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936674 (https://phabricator.wikimedia.org/T341440) (owner: 10Muehlenhoff)
[12:33:28] <claime>	 !log Sending 1% of global traffic to mw-on-k8s - T341463
[12:33:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:32] <stashbot>	 T341463: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463
[12:33:32] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect 1% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/936697 (https://phabricator.wikimedia.org/T341463) (owner: 10Clément Goubert)
[12:33:47] <wikibugs>	 (03PS8) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[12:34:22] <claime>	 !log Running puppet on cp-text hosts - T341463
[12:34:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:08] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:37:20] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:37:34] <wikibugs>	 (03PS9) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[12:39:18] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42370/console" [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[12:41:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: BFD flapping from cloudsw1-c8-eqiad (QFX5100) - https://phabricator.wikimedia.org/T341466 (10cmooney) p:05Triage→03Medium
[12:43:26] <wikibugs>	 (03PS5) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[12:43:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[12:46:38] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:50:24] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:50:25] <wikibugs>	 (03PS6) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[12:50:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[12:54:34] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10RobH) Order Number - 1-228138359365 entered for remote hands to power cycle the device and reply back to the ticket to let us...
[12:54:38] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqiad_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:22] <wikibugs>	 (03PS7) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[12:55:28] <wikibugs>	 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10MoritzMuehlenhoff) Looks good
[12:58:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[12:59:30] <wikibugs>	 (03PS8) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[12:59:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add dns-admins to list of sensitive groups [puppet] - 10https://gerrit.wikimedia.org/r/936674 (https://phabricator.wikimedia.org/T341440) (owner: 10Muehlenhoff)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1300).
[13:00:05] <jouncebot>	 arlolra: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:35] <Lucas_WMDE>	 I can deploy
[13:00:47] <wikibugs>	 (03PS1) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983)
[13:02:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[13:02:42] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Disable wgParserEnableLegacyMediaDOM on group2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[13:03:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-aborrero: gerrit.w.o is not included in https://config-master.wikimedia.org/known_hosts - https://phabricator.wikimedia.org/T340947 (10jbond) i have a patch out however id like to sort the results before merging this which will be much easier...
[13:03:55] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Disable wgParserEnableLegacyMediaDOM on group2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[13:04:21] <wikibugs>	 (03CR) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on group2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[13:04:49] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Disable wgParserEnableLegacyMediaDOM on group2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[13:04:53] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Disable wgParserEnableLegacyMediaDOM on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[13:05:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[13:05:43] <wikibugs>	 (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[13:05:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:936322|Disable wgParserEnableLegacyMediaDOM on group2 wikis (T314318)]]
[13:06:03] <stashbot>	 T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318
[13:07:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) >>! In T341440#9001013, @Jgreen wrote: >>>! In T341440#9000960, @MoritzMuehlenhoff wrote: >> Hmmh, won't you need additional sudo...
[13:07:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and arlolra: Backport for [[gerrit:936322|Disable wgParserEnableLegacyMediaDOM on group2 wikis (T314318)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[13:07:54] <Lucas_WMDE>	 arlolra: can you test on mwdebug?
[13:08:12] <arlolra>	 Yup
[13:09:03] <wikibugs>	 (03CR) 10Ssingh: "Ready for review." [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936297 (owner: 10Ssingh)
[13:10:09] <arlolra>	 Lucas_WMDE: looks good
[13:10:35] <Lucas_WMDE>	 alright, syncing then
[13:10:48] <arlolra>	 Thank you
[13:11:10] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10BTullis)
[13:11:57] * claime watches mw-on-k8s 503s on deployment
[13:12:01] <claime>	 Did we solve the problem? :D
[13:12:17] <Lucas_WMDE>	 hm?
[13:12:58] <claime>	 Lucas_WMDE: We used to have 503s when redeploying mw-on-k8s because of an improper shutdown order of containers, and kubernetes being werid
[13:13:01] <claime>	 weird*
[13:13:06] <Lucas_WMDE>	 ah
[13:13:39] <Lucas_WMDE>	 I guess you have a chance to find out? ^^
[13:13:46] <claime>	 'xactly :D
[13:13:48] <Lucas_WMDE>	 my scap already finished the running helmfile parts fwiw
[13:13:51] <claime>	 ack
[13:13:57] <Lucas_WMDE>	 just reached the php-fpm-restart
[13:14:19] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:14:21] <claime>	 I think we got maybe 1 or 2 on api-ext, none on web
[13:14:33] <claime>	 So far so good
[13:14:43] <Lucas_WMDE>	 cool
[13:15:52] <Lucas_WMDE>	 random question – do we know where all these jsonTruncated messages in logstash come from?
[13:16:14] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti1029 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[13:16:15] <Lucas_WMDE>	 the mediawiki-errors raw events list is currently just jsonTruncated, nothing else (among the 1–50 entries) – not extremely useful
[13:16:16] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host karapace1002.eqiad.wmnet
[13:16:17] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[13:16:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:936322|Disable wgParserEnableLegacyMediaDOM on group2 wikis (T314318)]] (duration: 10m 26s)
[13:16:29] <stashbot>	 T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318
[13:16:42] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ores extension: deploy LiftWing usage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935743 (https://phabricator.wikimedia.org/T319170)
[13:16:47] <Lucas_WMDE>	 arlolra: should be done
[13:16:48] <claime>	 Lucas_WMDE: I think it's because it's sending too big messages, godog may know more
[13:16:59] <arlolra>	 Lucas_WMDE: great, thank you
[13:17:05] <Amir1>	 Lucas_WMDE: once done, please ping me, I have a bunch of stuff to deploy
[13:17:11] <Lucas_WMDE>	 Amir1: I’m done, go ahead
[13:17:15] <Amir1>	 oh thanks
[13:17:18] <Lucas_WMDE>	 ^^
[13:17:26] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ExternalLinks: Make order by and continue only rely on el_id in READ NEW [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935856 (https://phabricator.wikimedia.org/T341000) (owner: 10Ladsgroup)
[13:17:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:17:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:17:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:17:44] <claime>	 Oh great
[13:17:46] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ores extension: deploy LiftWing usage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935743 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[13:17:54] <godog>	 claime Lucas_WMDE ack, I'll check
[13:18:04] <wikibugs>	 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh)
[13:18:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935743 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[13:18:42] <wikibugs>	 (03Merged) 10jenkins-bot: ores extension: deploy LiftWing usage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935743 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[13:18:44] <Lucas_WMDE>	 godog: I don’t think it’s particularly new, I just figured I’d ask
[13:18:52] <Lucas_WMDE>	 whether we know what’s sending the too-long messages, that is
[13:18:59] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:935743|ores extension: deploy LiftWing usage on testwiki (T319170)]]
[13:19:02] <stashbot>	 T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170
[13:19:07] <godog>	 Lucas_WMDE: oh ok, yeah I'm not sure right off the bat
[13:19:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936297 (owner: 10Ssingh)
[13:20:00] <urandom>	 o/
[13:20:22] <logmsgbot>	 !log ladsgroup@deploy1002 isaranto and ladsgroup: Backport for [[gerrit:935743|ores extension: deploy LiftWing usage on testwiki (T319170)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:20:24] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:20:26] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM karapace1002.eqiad.wmnet - btullis@cumin1001"
[13:21:10] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM karapace1002.eqiad.wmnet - btullis@cumin1001"
[13:21:10] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:21:10] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache karapace1002.eqiad.wmnet on all recursors
[13:21:13] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) karapace1002.eqiad.wmnet on all recursors
[13:21:19] <hashar>	 Lucas_WMDE: jsonTruncated messages is logstash receiving messages from MediaWiki that are too long. An exemple is logging a large SQL query (like a deadlock when batch  inserting a lot of fields)
[13:21:38] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM karapace1002.eqiad.wmnet - btullis@cumin1001"
[13:22:12] <Lucas_WMDE>	 ah ok
[13:22:22] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM karapace1002.eqiad.wmnet - btullis@cumin1001"
[13:22:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:22:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:22:32] <hashar>	 I thnk there is a dashboard dedicated to them but one has to look at the truncated raw json to find out the source
[13:22:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:23:23] <hashar>	 and I think there is some Grafana board tracking them as well as other logstash ingestion errors.
[13:23:48] <icinga-wm>	 PROBLEM - puppet last run on logstash1025 is CRITICAL: CRITICAL: Puppet has been disabled for 604864 seconds, message: jmm, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:23:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on logstash1025:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=logstash&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[13:24:13] <wikibugs>	 (03PS1) 10Ssingh: ntp/eqiad: point to dns1004 [dns] - 10https://gerrit.wikimedia.org/r/936703 (https://phabricator.wikimedia.org/T326685)
[13:26:13] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/936703 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[13:26:32] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] ntp/eqiad: point to dns1004 [dns] - 10https://gerrit.wikimedia.org/r/936703 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[13:26:34] <icinga-wm>	 RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 243.05 ms
[13:27:00] <icinga-wm>	 RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 239.32 ms
[13:27:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host karapace1002.eqiad.wmnet with OS bullseye
[13:27:08] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host karapace1002.eqiad.wmnet with OS bullseye
[13:27:08] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync
[13:27:34] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:27:38] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 218.45 ms
[13:27:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync
[13:28:02] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:935743|ores extension: deploy LiftWing usage on testwiki (T319170)]] (duration: 09m 02s)
[13:28:05] <stashbot>	 T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170
[13:28:09] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=yes; selector: name=wdqs2020.codfw.wmnet
[13:28:22] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:28:40] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:32:43] <wikibugs>	 (03PS1) 10Btullis: Add a second karapace VM [puppet] - 10https://gerrit.wikimedia.org/r/936706 (https://phabricator.wikimedia.org/T329514)
[13:33:21] <wikibugs>	 (03PS1) 10Ssingh: dns1005: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936709 (https://phabricator.wikimedia.org/T326685)
[13:33:23] <wikibugs>	 (03PS1) 10Ssingh: dns1006: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936710 (https://phabricator.wikimedia.org/T326685)
[13:33:47] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a second karapace VM [puppet] - 10https://gerrit.wikimedia.org/r/936706 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:34:24] <wikibugs>	 (03Merged) 10jenkins-bot: ExternalLinks: Make order by and continue only rely on el_id in READ NEW [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935856 (https://phabricator.wikimedia.org/T341000) (owner: 10Ladsgroup)
[13:35:09] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:935856|ExternalLinks: Make order by and continue only rely on el_id in READ NEW (T341000 T47237)]]
[13:35:15] <stashbot>	 T47237: LinkSearch uses numeric offset paging instead of paging by last entry returned - https://phabricator.wikimedia.org/T47237
[13:36:27] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on karapace1002.eqiad.wmnet with reason: host reimage
[13:36:39] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:935856|ExternalLinks: Make order by and continue only rely on el_id in READ NEW (T341000 T47237)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[13:39:19] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:39:38] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on karapace1002.eqiad.wmnet with reason: host reimage
[13:40:11] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "IP seems correct (checked on NetBox)" [puppet] - 10https://gerrit.wikimedia.org/r/936709 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[13:40:18] <wikibugs>	 (03CR) 10Fabfur: "IP seems correct (checked on NetBox)" [puppet] - 10https://gerrit.wikimedia.org/r/936710 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[13:42:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet
[13:44:22] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns1005: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936709 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[13:46:12] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:935856|ExternalLinks: Make order by and continue only rely on el_id in READ NEW (T341000 T47237)]] (duration: 11m 03s)
[13:46:17] <stashbot>	 T47237: LinkSearch uses numeric offset paging instead of paging by last entry returned - https://phabricator.wikimedia.org/T47237
[13:46:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:47:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:47:30] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host dns1005.wikimedia.org with OS bullseye
[13:47:40] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host dns1005.wikimedia.org with OS bullseye
[13:48:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (3) Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[13:48:55] <wikibugs>	 (03PS1) 10Ladsgroup: Set commons to READ_NEW for externallinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936716 (https://phabricator.wikimedia.org/T335343)
[13:50:37] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set commons to READ_NEW for externallinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936716 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup)
[13:51:14] <wikibugs>	 (03PS1) 10Samtar: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763)
[13:51:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936716 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup)
[13:51:28] <wikibugs>	 (03Merged) 10jenkins-bot: Set commons to READ_NEW for externallinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936716 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup)
[13:51:41] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:936716|Set commons to READ_NEW for externallinks migration (T335343)]]
[13:51:44] <stashbot>	 T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343
[13:52:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:52:34] <wikibugs>	 (03PS2) 10Samtar: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763)
[13:52:53] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host karapace1002.eqiad.wmnet with OS bullseye
[13:52:53] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host karapace1002.eqiad.wmnet
[13:53:03] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host karapace1002.eqiad.wmnet with OS b...
[13:53:05] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:936716|Set commons to READ_NEW for externallinks migration (T335343)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[13:54:43] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10BTullis) 05Open→03Resolved
[13:54:51] <wikibugs>	 (03PS1) 10Gmodena: mw-page-content-change-enrichment partition by (wiki_id, page_id) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936718 (https://phabricator.wikimedia.org/T338169)
[13:55:27] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add new dns host dns1005 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/936719 (https://phabricator.wikimedia.org/T326685)
[13:55:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet
[13:55:29] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add new dns host dns1006 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/936720 (https://phabricator.wikimedia.org/T326685)
[13:58:02] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/936720 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[13:58:06] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/936719 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[13:58:14] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host contint2002.wikimedia.org
[13:58:26] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Release pdns-recursor 4.8.4-1+wmf11u1. [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936297 (owner: 10Ssingh)
[13:59:32] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1005.wikimedia.org with reason: host reimage
[14:01:03] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:936716|Set commons to READ_NEW for externallinks migration (T335343)]] (duration: 09m 22s)
[14:01:07] <stashbot>	 T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343
[14:02:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet
[14:02:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet
[14:02:54] <logmsgbot>	 !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1005.wikimedia.org with reason: host reimage
[14:03:58] <icinga-wm>	 PROBLEM - Host an-worker1145 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:31] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint2002.wikimedia.org
[14:05:14] <icinga-wm>	 ACKNOWLEDGEMENT - confd service on an-worker1145 is CRITICAL: CRITICAL - Expecting active but unit confd is activating Btullis Cold booted for T341481 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:05:14] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on an-worker1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Btullis Cold booted for T341481 https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:05:14] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop DataNode on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode Btullis Cold booted for T341481 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[14:05:14] <icinga-wm>	 ACKNOWLEDGEMENT - Host an-worker1145 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Cold booted for T341481
[14:05:36] <jinxer-wm>	 (GitLabCIPipelineErrors) firing: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors
[14:07:26] <icinga-wm>	 RECOVERY - Host an-worker1145 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[14:07:36] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:38] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:08:21] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:34] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[14:10:03] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:10:06] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:10:36] <jinxer-wm>	 (GitLabCIPipelineErrors) resolved: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors
[14:13:08] <icinga-wm>	 RECOVERY - confd service on an-worker1145 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:13:26] <wikibugs>	 (03PS1) 10Jelto: gitlab: increase thresholds for GitLab CI alerts [alerts] - 10https://gerrit.wikimedia.org/r/936722 (https://phabricator.wikimedia.org/T341384)
[14:13:28] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) Equinix came back and said they rebooted.  Device is reachable again: ` cmooney@mr1-eqsin> show system uptime  Curren...
[14:13:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Ripe atlas eqiad reported down in Icinga since 2023-06-27 - https://phabricator.wikimedia.org/T341108 (10Jclark-ctr) 05Open→03Resolved
[14:14:19] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:14:48] <icinga-wm>	 RECOVERY - puppet last run on an-worker1145 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:15:23] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:15:27] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:19:19] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:19:22] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:22:37] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1001"
[14:22:38] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341035 (10Jclark-ctr) 05Open→03Resolved Replaced cables , reset idrac
[14:22:40] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:22:43] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:23:20] <logmsgbot>	 !log sukhe@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1001"
[14:23:21] <logmsgbot>	 !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1005.wikimedia.org with OS bullseye
[14:23:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host dns1005.wikimedia.org with OS bullseye completed: - dns1005 (**WARN**)   - Removed from Puppet an...
[14:23:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:26:15] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "running manually for dns1005 - sukhe@cumin1001"
[14:27:09] <logmsgbot>	 !log sukhe@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "running manually for dns1005 - sukhe@cumin1001"
[14:27:28] <wikibugs>	 (03CR) 10TChin: [C: 03+1] mw-page-content-change-enrichment partition by (wiki_id, page_id) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936718 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena)
[14:28:01] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10lmata)
[14:28:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:28:41] <logmsgbot>	 !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:28:46] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2023/2024-Q1), 10Sustainability (Incident Followup): Alert when no data is received from Prometheus in a certain amount of time - https://phabricator.wikimedia.org/T336448 (10lmata)
[14:28:47] <logmsgbot>	 !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:29:21] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns1005 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/936719 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[14:33:18] <fabfur>	 !log add new dns host dns1005 
[14:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:37] <wikibugs>	 (03CR) 10Gmodena: [C: 03+2] mw-page-content-change-enrichment partition by (wiki_id, page_id) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936718 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena)
[14:38:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab: increase thresholds for GitLab CI alerts [alerts] - 10https://gerrit.wikimedia.org/r/936722 (https://phabricator.wikimedia.org/T341384) (owner: 10Jelto)
[14:38:23] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-content-change-enrichment partition by (wiki_id, page_id) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936718 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena)
[14:40:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab: increase thresholds for GitLab CI alerts [alerts] - 10https://gerrit.wikimedia.org/r/936722 (https://phabricator.wikimedia.org/T341384) (owner: 10Jelto)
[14:45:38] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns1006 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/936720 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[14:46:20] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:46:24] <logmsgbot>	 !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:46:26] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns1006: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936710 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[14:47:13] <wikibugs>	 (03PS2) 10Ssingh: dns1006: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936710 (https://phabricator.wikimedia.org/T326685)
[14:48:14] <wikibugs>	 (03PS1) 10RobH: addming new skus [software] - 10https://gerrit.wikimedia.org/r/936746
[14:48:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] addming new skus [software] - 10https://gerrit.wikimedia.org/r/936746 (owner: 10RobH)
[14:48:38] <icinga-wm>	 RECOVERY - puppet last run on logstash1025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:48:48] <wikibugs>	 (03PS2) 10RobH: addming new skus [software] - 10https://gerrit.wikimedia.org/r/936746
[14:48:59] <jinxer-wm>	 (PuppetDisabled) resolved: Puppet disabled on logstash1025:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=logstash&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[14:49:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1006.wikimedia.org with OS bullseye
[14:49:27] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1006.wikimedia.org with OS bullseye
[14:49:31] <wikibugs>	 (03CR) 10RobH: [C: 03+2] updating R450 skus [software] - 10https://gerrit.wikimedia.org/r/936313 (owner: 10RobH)
[14:49:42] <wikibugs>	 (03PS3) 10RobH: addming new skus [software] - 10https://gerrit.wikimedia.org/r/936746
[14:51:35] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:51:38] <wikibugs>	 (03CR) 10RobH: [C: 03+2] addming new skus [software] - 10https://gerrit.wikimedia.org/r/936746 (owner: 10RobH)
[14:51:42] <logmsgbot>	 !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:52:07] <wikibugs>	 (03PS1) 10Andrew Bogott: radosgw: set per-user (aka per-project in swift) quotas. [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937)
[14:53:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:54:36] <wikibugs>	 (03CR) 10David Caro: "LGTM, we can re-adjust quotas later if needed" [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937) (owner: 10Andrew Bogott)
[14:55:33] <wikibugs>	 (03PS2) 10Andrew Bogott: radosgw: set per-user (aka per-project in swift) quotas. [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937)
[14:55:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm)
[14:56:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] envoy: Remove tls_minimum_protocol_version [puppet] - 10https://gerrit.wikimedia.org/r/935683 (https://phabricator.wikimedia.org/T337453) (owner: 10JMeybohm)
[14:56:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet
[14:57:32] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:57:36] <logmsgbot>	 !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:58:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:59:03] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) p:05High→03Medium Device remains healthy after over an hour.  In terms of what caused the initial problem the log...
[14:59:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs:  Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond)
[15:00:00] <moritzm>	 !log rebalance ganeti group eqiad/A after reboots
[15:00:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1006.wikimedia.org with reason: host reimage
[15:02:47] <wikibugs>	 (03PS1) 10Jsn.sherman: log additional events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212)
[15:04:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1006.wikimedia.org with reason: host reimage
[15:05:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet
[15:07:16] <icinga-wm>	 PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:07:28] <sukhe>	 ^ vgutierrez what you were talking about
[15:08:18] <icinga-wm>	 RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:11:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet
[15:11:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet
[15:15:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet
[15:16:00] <wikibugs>	 (03CR) 10Andrew Bogott: radosgw: set per-user (aka per-project in swift) quotas. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937) (owner: 10Andrew Bogott)
[15:16:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] radosgw: set per-user (aka per-project in swift) quotas. [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937) (owner: 10Andrew Bogott)
[15:16:48] <taavi>	 jouncebot: nowandnext
[15:16:48] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[15:16:48] <jouncebot>	 In 0 hour(s) and 13 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1530)
[15:17:25] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[15:18:51] <wikibugs>	 (03PS1) 10Majavah: wikitech: Update codfw1dev LDAP server hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936751
[15:18:53] <wikibugs>	 (03PS1) 10Majavah: Disable UrlShortener on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936752 (https://phabricator.wikimedia.org/T341470)
[15:19:31] <wikibugs>	 (03PS1) 10Btullis: Configure karapace1001 to use the kafka-jumbo cluster [puppet] - 10https://gerrit.wikimedia.org/r/936753 (https://phabricator.wikimedia.org/T329514)
[15:19:50] <wikibugs>	 (03CR) 10Vgutierrez: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[15:19:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet
[15:20:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs:  Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond)
[15:21:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs:  Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) I have marked of debmonitor as pki is used in production.
[15:21:27] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42376/console" [puppet] - 10https://gerrit.wikimedia.org/r/936753 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[15:23:41] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[15:23:47] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936754 (https://phabricator.wikimedia.org/T128546)
[15:23:49] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Configure karapace1001 to use the kafka-jumbo cluster [puppet] - 10https://gerrit.wikimedia.org/r/936753 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[15:25:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[15:25:14] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1006.wikimedia.org with OS bullseye
[15:25:19] <wikibugs>	 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1006.wikimedia.org with OS bullseye completed: - dns1006 (**PASS**)   - Removed from Puppet and PuppetDB if present...
[15:26:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet
[15:26:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6001.drmrs.wmnet
[15:27:34] <wikibugs>	 (03PS1) 10Btullis: Fix error in the motd definition for the karapace hosts [puppet] - 10https://gerrit.wikimedia.org/r/936755
[15:28:45] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42377/console" [puppet] - 10https://gerrit.wikimedia.org/r/936755 (owner: 10Btullis)
[15:29:40] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:05] <jouncebot>	 jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1530).
[15:30:29] <sukhe>	 !log homer "cr*-eqiad*" commit "Gerrit: 936720 add new DNS host dns1006"
[15:30:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:03] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42378/console" [puppet] - 10https://gerrit.wikimedia.org/r/936755 (owner: 10Btullis)
[15:31:22] <wikibugs>	 (03Abandoned) 10Btullis: karapace: switch karapace to use kafka-jumbo1001 [puppet] - 10https://gerrit.wikimedia.org/r/787112 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi)
[15:32:22] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936754 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:32:46] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix error in the motd definition for the karapace hosts [puppet] - 10https://gerrit.wikimedia.org/r/936755 (owner: 10Btullis)
[15:32:56] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) @MatthewVernon @Eevans please let me know what you think of the above proposal. I was imagining the final state to be `thanos-fe` / `thanos-be` running only Swif...
[15:33:35] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936754 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:40:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:41:48] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10aborrero)
[15:41:57] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10aborrero) 05In progress→03Resolved
[15:42:29] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[15:45:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:46:49] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:936654| Bumping portals to master (T128546)]] (duration: 06m 31s)
[15:46:54] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:50:22] <wikibugs>	 (03PS1) 10Fabfur: hiera: removed dns1002 and dns1003 [puppet] - 10https://gerrit.wikimedia.org/r/936756 (https://phabricator.wikimedia.org/T326685)
[15:51:08] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:53:20] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:936654| Bumping portals to master (T128546)]] (duration: 06m 30s)
[15:53:23] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:54:34] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] "lgtm - thanks for all your work on this \o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman)
[15:54:50] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Let's hold on merging this till we have moved ns0." [puppet] - 10https://gerrit.wikimedia.org/r/936756 (https://phabricator.wikimedia.org/T326685) (owner: 10Fabfur)
[15:55:10] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "(LGTM otherwise!)" [puppet] - 10https://gerrit.wikimedia.org/r/936756 (https://phabricator.wikimedia.org/T326685) (owner: 10Fabfur)
[15:56:08] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:57:54] <wikibugs>	 (03PS2) 10Majavah: wikitech: Update codfw1dev LDAP server hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936751
[15:57:59] <wikibugs>	 (03PS2) 10Majavah: Disable UrlShortener on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936752 (https://phabricator.wikimedia.org/T341470)
[15:58:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936751 (owner: 10Majavah)
[15:58:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936752 (https://phabricator.wikimedia.org/T341470) (owner: 10Majavah)
[15:58:55] <wikibugs>	 (03Merged) 10jenkins-bot: wikitech: Update codfw1dev LDAP server hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936751 (owner: 10Majavah)
[15:59:02] <wikibugs>	 (03Merged) 10jenkins-bot: Disable UrlShortener on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936752 (https://phabricator.wikimedia.org/T341470) (owner: 10Majavah)
[15:59:17] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:936751|wikitech: Update codfw1dev LDAP server hostname]], [[gerrit:936752|Disable UrlShortener on wikitech (T341470)]]
[15:59:21] <stashbot>	 T341470: UrlShortener throws DBConnectionError exception on wikitech - https://phabricator.wikimedia.org/T341470
[16:00:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] k8s::proxy: Start kube-proxy after ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915461 (owner: 10Clément Goubert)
[16:00:46] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:936751|wikitech: Update codfw1dev LDAP server hostname]], [[gerrit:936752|Disable UrlShortener on wikitech (T341470)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[16:01:23] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] log additional events on Special:Diff|MobileDiff (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman)
[16:02:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: BFD flapping from cloudsw1-c8-eqiad (QFX5100) - https://phabricator.wikimedia.org/T341466 (10cmooney) 05Open→03Resolved Session to cloudlb1001 is stable after over an hour so think this is good to close now with the fix of using longer timers ` cmooney@cloud...
[16:03:52] <wikibugs>	 (03PS1) 10Fabfur: dns: remove dns1002 and 1003 [homer/public] - 10https://gerrit.wikimedia.org/r/936757 (https://phabricator.wikimedia.org/T326685)
[16:05:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 04-2] "this is currently not working" [puppet] - 10https://gerrit.wikimedia.org/r/936281 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[16:05:41] <wikibugs>	 (03PS10) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[16:07:01] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10bd808) Is there any particular reason that the "[ ] Wikitech is ideal to dogfood mw-on-k8s, there are challenges though that we need to over come T292707" step w...
[16:07:05] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:936751|wikitech: Update codfw1dev LDAP server hostname]], [[gerrit:936752|Disable UrlShortener on wikitech (T341470)]] (duration: 07m 47s)
[16:07:09] <stashbot>	 T341470: UrlShortener throws DBConnectionError exception on wikitech - https://phabricator.wikimedia.org/T341470
[16:07:36] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42379/console" [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[16:09:31] <wikibugs>	 (03PS11) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826)
[16:11:12] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42380/console" [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[16:13:40] <icinga-wm>	 PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:15:12] <icinga-wm>	 RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:15:40] <wikibugs>	 (03CR) 10Nskaggs: "Thank you for setting larger quotas. +1 to encouraging people to migrate with a better offering, and part of that is a bigger quota." [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937) (owner: 10Andrew Bogott)
[16:15:42] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:17] <wikibugs>	 (03PS1) 10Jbond: rsyslog::receiver: update docs and add types [puppet] - 10https://gerrit.wikimedia.org/r/936762 (https://phabricator.wikimedia.org/T340741)
[16:19:20] <wikibugs>	 (03PS1) 10Jbond: rsyslog: update to use pki certificates [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741)
[16:21:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rsyslog: update to use pki certificates [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[16:23:57] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: emit no-cache unless otherwise asked [deployment-charts] - 10https://gerrit.wikimedia.org/r/936765 (https://phabricator.wikimedia.org/T338916)
[16:25:39] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@8fa416b]: T328276: Change articletopic source to the outlink model
[16:25:42] <stashbot>	 T328276: Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276
[16:25:59] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@8fa416b]: T328276: Change articletopic source to the outlink model (duration: 00m 20s)
[16:26:40] <sukhe>	 !log ns0: set routing-options static route 208.80.154.238/32 next-hop [ 208.80.154.6 208.80.154.153 208.80.154.77 ]
[16:26:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:16] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: emit no-cache unless otherwise asked [deployment-charts] - 10https://gerrit.wikimedia.org/r/936765 (https://phabricator.wikimedia.org/T338916)
[16:31:15] <wikibugs>	 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[16:31:26] <wikibugs>	 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) p:05Triage→03Medium
[16:35:05] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10Andrew) I'm fine with making things more verbose for now, then we can trim out things that...
[16:39:55] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) 05Resolved→03Open
[16:41:57] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns: remove dns1002 and 1003 [homer/public] - 10https://gerrit.wikimedia.org/r/936757 (https://phabricator.wikimedia.org/T326685) (owner: 10Fabfur)
[16:42:27] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) From comms with wikiwand:  It seems User-Agent and Api-User-Agent (for client-side requests) are ignored, can you p...
[16:42:40] <wikibugs>	 (03PS1) 10Krinkle: webperf: Set XHGUI_PDO_INITSCHEMA=0 to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767
[16:44:10] <sukhe>	 !log homer "cr*-eqiad*" commit "Gerrit: 936757 remove DNS hosts dns1002 and dns1003"
[16:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): spicerack: update spicrack to work with the newer puppet infrastructre - https://phabricator.wikimedia.org/T341496 (10jbond)
[16:47:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): spicerack: update spicerack to work with the newer puppet infrastructure - https://phabricator.wikimedia.org/T341496 (10Volans) p:05Triage→03Medium
[16:47:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructre - https://phabricator.wikimedia.org/T341497 (10jbond)
[16:48:05] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructre - https://phabricator.wikimedia.org/T341497 (10jbond) p:05Triage→03Medium
[16:49:54] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[16:50:11] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: removed dns1002 and dns1003 [puppet] - 10https://gerrit.wikimedia.org/r/936756 (https://phabricator.wikimedia.org/T326685) (owner: 10Fabfur)
[16:50:50] <wikibugs>	 (03CR) 10Jsn.sherman: log additional events on Special:Diff|MobileDiff (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman)
[16:52:04] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) `Wikiwand/0.1 (https://www.wikiwand.com; admin@wikiwand.com)` added to the list of user-agents. Please advise if it...
[16:52:53] <sukhe>	 !log rolling restart of ntp.service on A:dns-rec
[16:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1700)
[17:00:05] <jouncebot>	 ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1700).
[17:11:07] <wikibugs>	 (03PS2) 10Krinkle: webperf: Set XHGUI_PDO_INITSCHEMA=0 to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767 (https://phabricator.wikimedia.org/T341499)
[17:11:26] <wikibugs>	 (03PS3) 10Krinkle: webperf: Set XHGUI_PDO_INITSCHEMA=0 to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767 (https://phabricator.wikimedia.org/T341499)
[17:11:33] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/936767 (https://phabricator.wikimedia.org/T341499) (owner: 10Krinkle)
[17:14:24] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341503 (10phaultfinder)
[17:15:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/936329/42383/planet1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/936329 (owner: 10Dzahn)
[17:15:28] <wikibugs>	 (03Abandoned) 10Dzahn: planet: remove buster support [puppet] - 10https://gerrit.wikimedia.org/r/936331 (owner: 10Dzahn)
[17:16:08] <icinga-wm>	 PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:18:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: add statictendril release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[17:19:36] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: add statictendril release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[17:21:14] <wikibugs>	 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10serviceops, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF)
[17:24:49] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[17:26:22] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[17:32:10] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2022/2023-Q4): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri)
[17:32:17] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2022/2023-Q4): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri)
[17:33:26] <wikibugs>	 (03CR) 10Dzahn: "curl against staging cluster looks good: https://phabricator.wikimedia.org/T340182#9002597" [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[17:33:41] <wikibugs>	 (03PS1) 10Jbond: puppet: drop PuppetHosts.get_ca_servers [software/spicerack] - 10https://gerrit.wikimedia.org/r/936774 (https://phabricator.wikimedia.org/T341496)
[17:47:08] <wikibugs>	 (03PS1) 10Dzahn: miscweb: add statictendril to eqiad and codfw k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/936775 (https://phabricator.wikimedia.org/T340182)
[17:47:46] <wikibugs>	 (03PS1) 10Ssingh: common.yaml: remove dns1002 and dns1003 from ntp_peers [homer/public] - 10https://gerrit.wikimedia.org/r/936776 (https://phabricator.wikimedia.org/T326685)
[17:48:59] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] common.yaml: remove dns1002 and dns1003 from ntp_peers [homer/public] - 10https://gerrit.wikimedia.org/r/936776 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh)
[17:51:35] <sukhe>	 !log homer "mr*" commit "update ntp_servers (remove dns100[2-3], add dns100[5-6])"
[17:51:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:12] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: add statictendril to eqiad and codfw k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/936775 (https://phabricator.wikimedia.org/T340182) (owner: 10Dzahn)
[17:55:13] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: add statictendril to eqiad and codfw k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/936775 (https://phabricator.wikimedia.org/T340182) (owner: 10Dzahn)
[17:55:40] <logmsgbot>	 !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[17:59:30] <wikibugs>	 (03PS2) 10Ladsgroup: sre.mysql.clone: Only encrypt data transfers between DCs [cookbooks] - 10https://gerrit.wikimedia.org/r/936287
[18:00:33] <wikibugs>	 (03CR) 10Ladsgroup: sre.mysql.clone: Only encrypt data transfers between DCs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup)
[18:02:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.mysql.clone: Only encrypt data transfers between DCs [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup)
[18:03:13] <logmsgbot>	 !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[18:06:54] <wikibugs>	 (03CR) 10Michael Große: Beta-Wikidata: Always show mul on desktop Termbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große)
[18:10:20] <wikibugs>	 (03PS3) 10Ladsgroup: sre.mysql.clone: Only encrypt data transfers between DCs [cookbooks] - 10https://gerrit.wikimedia.org/r/936287
[18:13:40] <icinga-wm>	 RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:14:04] <wikibugs>	 10ops-eqiad, 10Traffic: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10ssingh)
[18:14:30] <wikibugs>	 (03PS1) 10Jbond: puppet: Add versions method which will return the version of the agnts [software/spicerack] - 10https://gerrit.wikimedia.org/r/936781 (https://phabricator.wikimedia.org/T341496)
[18:14:32] <wikibugs>	 (03PS1) 10Jbond: WIP:puppet: Add support for puppetserver v7 [software/spicerack] - 10https://gerrit.wikimedia.org/r/936782
[18:17:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP:puppet: Add support for puppetserver v7 [software/spicerack] - 10https://gerrit.wikimedia.org/r/936782 (owner: 10Jbond)
[18:18:21] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:18:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: Add versions method which will return the version of the agnts [software/spicerack] - 10https://gerrit.wikimedia.org/r/936781 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[18:25:44] <wikibugs>	 (03CR) 10Fabfur: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[18:26:14] <logmsgbot>	 !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[18:29:06] <logmsgbot>	 !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[18:31:32] <logmsgbot>	 !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[18:32:38] <logmsgbot>	 !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[18:35:49] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10BCornwall) 05Open→03Stalled a:03BBlack This was under the request of @BBlack - I believe the intention was that this would be "good enough" for t...
[18:37:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1012.eqiad.wmnet
[18:38:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns[1002-1003].wikimedia.org
[18:40:24] <wikibugs>	 (03PS1) 10Dzahn: trafficserver: switch dbtree/tendril to k8s backend [puppet] - 10https://gerrit.wikimedia.org/r/936785 (https://phabricator.wikimedia.org/T340182)
[18:40:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver: switch dbtree/tendril to k8s backend [puppet] - 10https://gerrit.wikimedia.org/r/936785 (https://phabricator.wikimedia.org/T340182) (owner: 10Dzahn)
[18:41:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:42:56] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/935885 (https://phabricator.wikimedia.org/T341511)
[18:43:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox
[18:44:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 37 hosts with reason: Primary switchover s1 T341511
[18:44:50] <stashbot>	 T341511: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T341511
[18:45:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 37 hosts with reason: Primary switchover s1 T341511
[18:45:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2103 with weight 0 T341511', diff saved to https://phabricator.wikimedia.org/P49535 and previous config saved to /var/cache/conftool/dbconfig/20230710-184521-ladsgroup.json
[18:46:12] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[18:46:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:46:20] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10ssingh)
[18:47:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10ssingh) The hosts have been decomissioned and ready for the hardware part.
[18:47:54] <sukhe>	 Amir1: ok to remove dbproxy entries?
[18:47:57] <sukhe>	 -134 1H IN PTR dbproxy1012.eqiad.wmnet.
[18:48:20] <sukhe>	 14:37:38 <+logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1012.eqiad.wmnet
[18:48:23] <sukhe>	 DNS changes from here
[18:48:52] <mutante>	 18:39 < Amir1> Hey, I'm decommissioning dbproxy10[12-17] and they are mentioned in two helm charts: 
[18:49:16] <sukhe>	 yeah should be fine, plus the cookbook already ran by now!
[18:49:17] <sukhe>	 thanks
[18:49:53] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns[1002-1003].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[18:49:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:49:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbproxy1012.eqiad.wmnet
[18:49:58] <mutante>	 unless something is unhappy when names dont resolve at all.. vs host being just unreahcable
[18:50:03] <sukhe>	 uh oh
[18:50:18] <sukhe>	 which I guess is expected
[18:50:25] <sukhe>	 the uh oh was for the failure above :)
[18:50:27] <Amir1>	 sorry I missed this
[18:50:31] <sukhe>	 np, resolved
[18:50:46] <sukhe>	 I am going to remove the rest of the dbproxy stuff too :>
[18:50:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns[1002-1003].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[18:50:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:50:55] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns[1002-1003].wikimedia.org
[18:51:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1012.eqiad.wmnet
[18:51:03] <wikibugs>	 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns[1002-1003].wikimedia.org` - dns1002.wikimedia.org (**WARN**)   - Downtimed host on Icinga/Alertmanager   - Found ph...
[18:51:38] <sukhe>	 Amir1: did I break your cookbook?
[18:51:41] <sukhe>	 sorry if I did
[18:51:44] <sukhe>	 what was the error you got?
[18:54:41] <wikibugs>	 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh)
[18:55:08] <wikibugs>	 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh)
[18:55:29] <wikibugs>	 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) 05In progress→03Resolved Traffic has commissioned these boxes. Many thanks to dc-ops!
[18:55:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox
[18:56:04] <sukhe>	 !log finished commissionioning new DNS hosts in eqiad: dns100[4-6]. decomissioned dns100[1-3].
[18:56:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:57:01] <wikibugs>	 (03PS1) 10Ssingh: templates: dummy commit to test new DNS boxes [dns] - 10https://gerrit.wikimedia.org/r/936787
[18:57:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbproxy1012.eqiad.wmnet
[18:58:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[18:59:30] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] templates: dummy commit to test new DNS boxes [dns] - 10https://gerrit.wikimedia.org/r/936787 (owner: 10Ssingh)
[18:59:42] <sukhe>	 !log running authdns-update 
[18:59:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[19:09:28] <wikibugs>	 (03PS2) 10Dzahn: trafficserver: switch dbtree/tendril to k8s backend [puppet] - 10https://gerrit.wikimedia.org/r/936785 (https://phabricator.wikimedia.org/T340182)
[19:10:13] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/935885 (https://phabricator.wikimedia.org/T341511) (owner: 10Gerrit maintenance bot)
[19:11:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] trafficserver: switch dbtree/tendril to k8s backend [puppet] - 10https://gerrit.wikimedia.org/r/936785 (https://phabricator.wikimedia.org/T340182) (owner: 10Dzahn)
[19:12:06] <Amir1>	 !log Starting s1 codfw failover from db2112 to db2103 - T341511
[19:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:10] <stashbot>	 T341511: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T341511
[19:13:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2103 to s1 primary T341511', diff saved to https://phabricator.wikimedia.org/P49536 and previous config saved to /var/cache/conftool/dbconfig/20230710-191259-ladsgroup.json
[19:15:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2112 T341511', diff saved to https://phabricator.wikimedia.org/P49537 and previous config saved to /var/cache/conftool/dbconfig/20230710-191511-ladsgroup.json
[19:17:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance
[19:17:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance
[19:18:34] <wikibugs>	 (03PS1) 10Ladsgroup: ExternalLinks: Make oneWildcard avoid adding wildcard to domain [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936733 (https://phabricator.wikimedia.org/T326251)
[19:20:41] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1012.eqiad.wmnet - https://phabricator.wikimedia.org/T341510 (10Ladsgroup) a:05Ladsgroup→03wiki_willy
[19:21:05] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1012.eqiad.wmnet - https://phabricator.wikimedia.org/T341510 (10Ladsgroup) The cookbook was a bit messy but it should be done now
[19:21:13] <wikibugs>	 (03Abandoned) 10Dzahn: miscweb: add release statictendril to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/930887 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[19:21:57] <wikibugs>	 (03PS2) 10Dzahn: miscweb: remove static_tendril classes and files [puppet] - 10https://gerrit.wikimedia.org/r/932337 (https://phabricator.wikimedia.org/T300171)
[19:23:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance
[19:23:47] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1012.eqiad.wmnet - https://phabricator.wikimedia.org/T341510 (10wiki_willy) a:05wiki_willy→03Jclark-ctr
[19:23:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance
[19:33:56] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] ci/zuul: set contint2002 as the active ci::manager_host [puppet] - 10https://gerrit.wikimedia.org/r/935919 (https://phabricator.wikimedia.org/T324659) (owner: 10Jelto)
[19:35:20] <wikibugs>	 (03PS2) 10Samtar: log additional events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman)
[19:40:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P49538 and previous config saved to /var/cache/conftool/dbconfig/20230710-194022-ladsgroup.json
[19:48:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[19:52:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[19:52:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[19:55:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P49540 and previous config saved to /var/cache/conftool/dbconfig/20230710-195527-ladsgroup.json
[19:59:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1124.eqiad.wmnet with reason: Reboot
[20:00:06] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T2000).
[20:00:06] <jouncebot>	 JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1124.eqiad.wmnet with reason: Reboot
[20:00:19] * TheresNoTime can deploy
[20:00:34] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10Eevans) >>! In T341488#9001995, @fgiunchedi wrote: > @MatthewVernon @Eevans please let me know what you think of the above proposal. I was imagining the...
[20:01:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman)
[20:01:57] <wikibugs>	 (03Merged) 10jenkins-bot: log additional events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman)
[20:02:13] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:936748|log additional events on Special:Diff|MobileDiff (T326212)]]
[20:02:18] <stashbot>	 T326212: Improve data logging on Special:Diff and Special:MobileDiff - https://phabricator.wikimedia.org/T326212
[20:03:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:03:35] <logmsgbot>	 !log samtar@deploy1002 samtar and jsn: Backport for [[gerrit:936748|log additional events on Special:Diff|MobileDiff (T326212)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[20:03:47] <TheresNoTime>	 JSherman: can you test this change on mwdebug?
[20:03:59] <JSherman>	 wilco
[20:08:38] <JSherman>	 TheresNoTime: So I'm navigating diffs with the debug extension, and then checking https://stream.wikimedia.org/v1/stream/mediawiki.special_diff_interactions but I'm not seeing anything. Maybe I don't know how to access production events?
[20:09:52] <TheresNoTime>	 I'm not seeing any events on https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002?_g=h@8daf61d&_a=h@7f0701a, are you sure you're using a mwdebug server via https://wikitech.wikimedia.org/wiki/WikimediaDebug ?
[20:09:54] <JSherman>	 oh, helps to use the right url: https://stream.wikimedia.org/v2/stream/mediawiki.special_diff_interactions but I'm getting stream not found
[20:10:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P49541 and previous config saved to /var/cache/conftool/dbconfig/20230710-201031-ladsgroup.json
[20:11:23] <RhinosF1>	 JSherman: how does stream pick it up?
[20:11:43] <RhinosF1>	 I'm sure this isn't the first time new events haven't been noticed from debug
[20:12:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: remove static_tendril classes and files [puppet] - 10https://gerrit.wikimedia.org/r/932337 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[20:13:32] <wikibugs>	 (03PS4) 10Krinkle: webperf: Set XHGUI_PDO_INITSCHEMA=0 to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767
[20:13:52] <wikibugs>	 (03PS5) 10Krinkle: webperf: Set XHGUI_PDO_INITSCHEMA=false to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767
[20:14:00] <JSherman>	 RhinosF1: that is a good question that I don't know the answer to. I'm realizing that I may be coming into this too naively. When I deployed this to beta, I was just able to curl https://stream-beta.wmflabs.org/v2/stream/mediawiki.special_diff_interactions and get log events
[20:14:33] <mutante>	 !log miscweb1003/miscweb2003 - rm -rf /srv/org/wikimedia/static-tendril
[20:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:08] <RhinosF1>	 JSherman: is there anyone online or can we verify it won't break anything else if we were to sync and wait a few minutes
[20:15:20] <RhinosF1>	 Assuming TheresNoTime is comfortable
[20:16:00] <TheresNoTime>	 JSherman: I'm not seeing any obvious errors, what's the risks of syncing? I do note that https://stream-beta.wmflabs.org/?doc#/streams lists the stream, whereas https://stream.wikimedia.org/?doc#/streams does not
[20:16:46] <JSherman>	 yeah, with beta, the stream wasn't created/available until there were events in the topic
[20:17:15] <icinga-wm>	 PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100%
[20:17:19] <TheresNoTime>	 okay, makes sense — I'm happy to sync this and revert if needed
[20:18:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:18:02] <JSherman>	 TheresNoTime: I appreciate that; I'm ready to test
[20:18:11] <TheresNoTime>	 syncing
[20:18:15] <icinga-wm>	 RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[20:19:18] <TheresNoTime>	 !log syncing https://gerrit.wikimedia.org/r/c/936748 untested (T326212) for test after sync
[20:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:22] <stashbot>	 T326212: Improve data logging on Special:Diff and Special:MobileDiff - https://phabricator.wikimedia.org/T326212
[20:23:53] <inflatador>	 !log bking@wdqs1006 Restart wdqs-blazegraph to hopefully clear the free allocators alerts
[20:23:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:56] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:936748|log additional events on Special:Diff|MobileDiff (T326212)]] (duration: 21m 42s)
[20:24:14] <TheresNoTime>	 JSherman: okay, please test
[20:25:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P49544 and previous config saved to /var/cache/conftool/dbconfig/20230710-202536-ladsgroup.json
[20:26:10] <JSherman>	 TheresNoTime: hmm, still not seeing anything, though it's my understanding that there can be some lag
[20:26:41] <TheresNoTime>	 okay, I'll keep an eye for errors but let's leave it 15 minutes?
[20:27:07] <JSherman>	 sounds good; I'll be clicking around and curling in the mean time.
[20:28:25] <wikibugs>	 (03PS1) 10Btullis: Configure datahub staging to use the new karapace instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/936791 (https://phabricator.wikimedia.org/T329514)
[20:31:40] <JSherman>	 TheresNoTime: It looks like my instrument isn't posting, though I can see readers instrument is posting just fine.
[20:31:59] <TheresNoTime>	 JSherman: hm, would you like to revert?
[20:32:25] <JSherman>	 yeah, let's do that; I'll go back and try to sort out why that's happening.
[20:32:51] <wikibugs>	 (03PS1) 10Samtar: Revert "log additional events on Special:Diff|MobileDiff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936735
[20:33:01] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Configure datahub staging to use the new karapace instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/936791 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[20:33:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936735 (owner: 10Samtar)
[20:33:45] <wikibugs>	 (03Merged) 10jenkins-bot: Configure datahub staging to use the new karapace instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/936791 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[20:34:28] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "log additional events on Special:Diff|MobileDiff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936735 (owner: 10Samtar)
[20:34:46] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:936735|Revert "log additional events on Special:Diff|MobileDiff"]]
[20:36:11] <logmsgbot>	 !log samtar@deploy1002 samtar: Backport for [[gerrit:936735|Revert "log additional events on Special:Diff|MobileDiff"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[20:36:24] <TheresNoTime>	 (syncing, forgot to bypass that)
[20:37:37] <JSherman>	 TheresNoTime: thanks for your deployment & reversion efforts!
[20:37:48] <TheresNoTime>	 No worries, sorry it didn't work out! :D
[20:39:39] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] fifo_log_demux: Fix systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez)
[20:40:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:40:21] <wikibugs>	 (03CR) 10Jforrester: [DNM][WIP] Initial configuration for Wikifunctions.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[20:40:27] <wikibugs>	 (03PS3) 10Jforrester: Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219)
[20:40:29] <wikibugs>	 (03PS5) 10Jforrester: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945)
[20:40:31] <wikibugs>	 (03PS5) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945)
[20:40:33] <wikibugs>	 (03PS4) 10Jforrester: [DNM][WIP] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945)
[20:41:00] <RhinosF1>	 Thanks for trying TheresNoTime
[20:41:11] <TheresNoTime>	 ^^
[20:41:28] <wikibugs>	 (03PS1) 10Btullis: Configure the test datahub jobs to use the staging schema registry [puppet] - 10https://gerrit.wikimedia.org/r/936792 (https://phabricator.wikimedia.org/T329514)
[20:42:10] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[20:42:13] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:936735|Revert "log additional events on Special:Diff|MobileDiff"]] (duration: 07m 27s)
[20:43:12] <TheresNoTime>	 !log close UTC late backport window
[20:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:46:09] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[20:53:10] <wikibugs>	 (03CR) 10Clare Ming: "sorry i missed this -- just noticed you had to revert -- i think it's because you didn't define a sampling rate in your production stream" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman)
[20:54:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:59:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T2100).
[21:00:30] <brett>	 oh that one is baaaad
[21:02:39] <wikibugs>	 (03PS2) 10Jdlrobson: Logos: Fixes grantswiki and idwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936097
[21:14:19] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] hieradata: labweb: update lvs pool to reference the ssl service [puppet] - 10https://gerrit.wikimedia.org/r/831173 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[21:14:29] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] service: remove plaintext labweb service (I) [puppet] - 10https://gerrit.wikimedia.org/r/831174 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[21:14:35] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] service: remove plaintext labweb service (II) [puppet] - 10https://gerrit.wikimedia.org/r/831175 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[21:15:42] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "LGTM, nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/831176 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[21:18:14] <wikibugs>	 10SRE, 10Infrastructure Security: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10Aklapper)
[21:22:39] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall)
[21:25:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:30:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:33:08] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:33:18] <wikibugs>	 (03PS1) 10Btullis: Permit staging datahub to access karapace1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936793 (https://phabricator.wikimedia.org/T329514)
[21:34:57] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Permit staging datahub to access karapace1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936793 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[21:35:33] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:35:44] <wikibugs>	 (03Merged) 10jenkins-bot: Permit staging datahub to access karapace1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936793 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[21:35:47] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:36:53] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[21:37:05] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.628 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:37:09] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:37:46] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 52s)
[21:38:05] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy)
[21:38:08] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:38:19] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy)
[21:39:33] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[21:42:05] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[22:10:22] <wikibugs>	 (03PS2) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983)
[22:12:06] <maryum>	 !log Deployed security patch for T340200
[22:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:00] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42384/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[22:19:19] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:21:44] <wikibugs>	 (03PS3) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983)
[22:33:54] <wikibugs>	 (03PS1) 10Ladsgroup: Override liftwing hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936796 (https://phabricator.wikimedia.org/T319170)
[22:34:50] <wikibugs>	 (03CR) 10Majavah: "Shouldn't this be using a service proxy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936796 (https://phabricator.wikimedia.org/T319170) (owner: 10Ladsgroup)
[22:45:55] <wikibugs>	 (03CR) 10Fabfur: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[22:58:14] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696)
[23:07:16] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: Add more CORS headers to docker registry [puppet] - 10https://gerrit.wikimedia.org/r/936797 (https://phabricator.wikimedia.org/T232135)
[23:11:30] <Krinkle>	 !log krinkle@xhgui1001$ Define new `xhgui.watches` table via xhguiadmin@m2-master.eqiad.wmnet database, ref T341499
[23:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:34] <stashbot>	 T341499: Upgrade XHGui from 0.14.0 to latest (0.21.3) - https://phabricator.wikimedia.org/T341499
[23:13:36] <wikibugs>	 (03CR) 10Krinkle: [V: 03+1] "I've tested this in Beta Cluster first, both on the version currently in production via performance/docroot.git (xhgui 0.14.0), and with t" [puppet] - 10https://gerrit.wikimedia.org/r/936767 (owner: 10Krinkle)
[23:13:59] <wikibugs>	 (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/output/936797/42386/" [puppet] - 10https://gerrit.wikimedia.org/r/936797 (https://phabricator.wikimedia.org/T232135) (owner: 10BryanDavis)
[23:48:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer