[00:00:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:02:36] RECOVERY - confd service on an-worker1145 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:05:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:07:14] PROBLEM - confd service on an-worker1145 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:19:32] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:30:18] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:38:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935879 [00:38:10] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935879 (owner: 10TrainBranchBot) [00:43:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:48:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:54:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/935879 (owner: 10TrainBranchBot) [00:55:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:00:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:00:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:01:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:05:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:06:11] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [01:06:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:06:24] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [01:11:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:16:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:20:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:25:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:29:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:42:58] PROBLEM - puppet last run on an-worker1145 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:43:21] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:06] PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:01:36] RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:08:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:21] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:19] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:15] 10SRE, 10Observability-Metrics: Tooling for end-of-quarter SLO reporting - https://phabricator.wikimedia.org/T290924 (10lmata) [02:41:32] (03PS1) 10RLazarus: opentelemetry-collector: Vendor 0.62.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936388 (https://phabricator.wikimedia.org/T324117) [02:41:34] (03PS1) 10RLazarus: opentelemetry-collector: Fix image and entry point [deployment-charts] - 10https://gerrit.wikimedia.org/r/936389 (https://phabricator.wikimedia.org/T320564) [02:49:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:53:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:16:12] (03Restored) 10Anzx: Enable tabs for non logged in users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932284 (https://phabricator.wikimedia.org/T340276) (owner: 10Anzx) [03:18:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:22:27] (03PS2) 10Anzx: Enable tabs for non logged in users on knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932284 (https://phabricator.wikimedia.org/T340276) [03:24:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:44:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:58:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:08:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:24:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:34:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:38:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:13:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:37:16] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341437 (10phaultfinder) [05:37:18] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341438 (10phaultfinder) [05:42:15] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341438 (10phaultfinder) [05:42:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T341438 (10phaultfinder) [05:43:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:48:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:06:05] (03PS1) 10KartikMistry: Update cxserver to 2023-07-06-065912-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936391 (https://phabricator.wikimedia.org/T340989) [06:10:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-debug-repl: improve UX (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936280 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [06:16:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:21:12] (03CR) 10Elukey: [C: 04-1] "need some work" [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [06:21:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:22:17] (03PS5) 10MdsShakil: Deploy action blocks on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934614 (https://phabricator.wikimedia.org/T340904) [06:26:58] (03PS1) 10Giuseppe Lavagetto: mediawiki::repl: allow execution from everyone [puppet] - 10https://gerrit.wikimedia.org/r/936394 (https://phabricator.wikimedia.org/T341197) [06:30:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:33:21] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:37:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::repl: allow execution from everyone [puppet] - 10https://gerrit.wikimedia.org/r/936394 (https://phabricator.wikimedia.org/T341197) (owner: 10Giuseppe Lavagetto) [06:41:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [06:43:20] !log add 100G to prometheus/k8s in codfw [06:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [06:50:24] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) [06:50:46] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [06:55:03] (03PS2) 10KartikMistry: Update cxserver to 2023-07-10-065135-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936391 (https://phabricator.wikimedia.org/T337719) [06:55:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [06:56:30] (03PS2) 10JMeybohm: Add AppArmor configuration for the deployed function-evaluator service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (https://phabricator.wikimedia.org/T326785) (owner: 10Cory Massaro) [06:57:19] (03CR) 10CI reject: [V: 04-1] Add AppArmor configuration for the deployed function-evaluator service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (https://phabricator.wikimedia.org/T326785) (owner: 10Cory Massaro) [06:58:04] PROBLEM - Host aux-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [07:00:06] Amir1, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T0700). [07:00:06] Func and MdsShakil: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] Hi :) [07:00:13] o/ [07:00:26] RECOVERY - Host aux-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [07:01:06] (03CR) 10JMeybohm: "As for the actual profile: That needs to be shipped via puppet IIRC - I don't think that has been implemented yet." [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (https://phabricator.wikimedia.org/T326785) (owner: 10Cory Massaro) [07:01:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [07:01:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [07:02:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [07:02:57] (03PS1) 10Vgutierrez: trafficserver: add gateway routing script, route device-analytics on cp2037 [puppet] - 10https://gerrit.wikimedia.org/r/936509 (https://phabricator.wikimedia.org/T320967) [07:04:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:00] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:05:00] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42357/console" [puppet] - 10https://gerrit.wikimedia.org/r/936509 (https://phabricator.wikimedia.org/T320967) (owner: 10Vgutierrez) [07:05:06] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:05:16] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:05:40] PROBLEM - BFD status on cr3-knams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:05:46] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:05:58] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:09:38] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:09:44] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:09:54] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:10:20] RECOVERY - BFD status on cr3-knams is OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:10:26] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:10:38] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:11:10] Zzzzzzzz [07:15:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:15:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:18:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:19:13] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [07:20:16] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [07:20:27] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [07:21:08] !log deploy1002: removed empty untracked directory from MediaWiki staging area: `rmdir /srv/mediawiki-staging/wmf-config/scap/log/ && rmdir /srv/mediawiki-staging/wmf-config/scap/` | T341292 [07:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:11] T341292: scap backport should remove code for removed submodules - https://phabricator.wikimedia.org/T341292 [07:21:15] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [07:21:24] not synced cause they are empty directories not holding any code [07:21:37] left over from a 2016 deploy of some sort [07:22:02] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [07:22:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [07:22:38] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) I think for wikiwand we only allow requests based on referer should we add or replace the rule with the user agent? [07:23:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:26:35] hashar: Hi, could you help with the backport window? [07:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:27:30] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, I'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251) (owner: 10Hashar) [07:27:46] PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:28:55] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [07:29:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [07:29:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet [07:29:20] RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:29:49] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [07:30:14] !log installing libgstreamer-plugins-base1.0-0 security updates [07:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:36] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [07:30:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [07:32:23] Func: yes! [07:32:34] jouncebot: now [07:32:34] For the next 0 hour(s) and 27 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T0700) [07:32:50] (03CR) 10Jelto: [C: 03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [07:32:56] I guess nobody is running it, so I will [07:33:05] thanks [07:33:15] sorry Func and MdsShakil , I usually don't run the backport window and thus haven't thought about checking the patches this morning [07:33:40] * hashar grab coffee number N+1 [07:33:58] PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:35:32] RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:35:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935876 (https://phabricator.wikimedia.org/T341407) (owner: 10Func) [07:35:46] Func: doing it :) [07:36:10] (03PS1) 10Elukey: services: allow kafka batches in EventGate's main producer [deployment-charts] - 10https://gerrit.wikimedia.org/r/936515 (https://phabricator.wikimedia.org/T338357) [07:36:24] (03Merged) 10jenkins-bot: thwiki: Update logos from commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935876 (https://phabricator.wikimedia.org/T341407) (owner: 10Func) [07:36:26] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add bookworm to the local build configurations [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935693 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto) [07:36:51] !log hashar@deploy1002 Started scap: Backport for [[gerrit:935876|thwiki: Update logos from commons (T341407)]] [07:36:54] T341407: Update th.wikipedia.org logo - https://phabricator.wikimedia.org/T341407 [07:37:21] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] images: convert use of seed_image into use of image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935694 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto) [07:38:48] (03CR) 10Elukey: "The 5ms setting is the default for node-rdkafka, basically what's suggested by upstream. I should improve things on the kafka main eqiad s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/936515 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [07:39:40] (03CR) 10Elukey: [C: 03+1] istio: convert use of seed_image into use of image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935695 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto) [07:39:54] (03PS1) 10Urbanecm: Growth: Increase mentorship percentage to 25% on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936639 (https://phabricator.wikimedia.org/T341399) [07:40:42] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] istio: convert use of seed_image into use of image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935695 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto) [07:41:41] well it is pushing a 5 GBytes docker image at 5MB/s so that is taking a bit of time [07:41:43] (03CR) 10JMeybohm: [C: 03+1] cert-manager: convert use of seed_image to image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935696 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto) [07:42:51] (03PS1) 10Muehlenhoff: Add library hints for gst-plugins-base1.0 [puppet] - 10https://gerrit.wikimedia.org/r/936649 [07:45:57] !log hashar@deploy1002 func and hashar: Backport for [[gerrit:935876|thwiki: Update logos from commons (T341407)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [07:46:00] T341407: Update th.wikipedia.org logo - https://phabricator.wikimedia.org/T341407 [07:46:19] (03CR) 10Muehlenhoff: [C: 03+2] Add library hints for gst-plugins-base1.0 [puppet] - 10https://gerrit.wikimedia.org/r/936649 (owner: 10Muehlenhoff) [07:46:57] hashar: confirmed fixed [07:47:12] Func: thank you for the confirmation! :] [07:47:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [07:53:38] MdsShakil: I am deploying your change for "Deploy action blocks on bnwiki" [07:54:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934614 (https://phabricator.wikimedia.org/T340904) (owner: 10MdsShakil) [07:54:02] I am around :) [07:54:11] (03CR) 10Jaime Nuche: [C: 03+1] contint: replace Apache 2.2 access control syntax for Jenkins proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [07:54:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [07:54:43] (03Merged) 10jenkins-bot: Deploy action blocks on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934614 (https://phabricator.wikimedia.org/T340904) (owner: 10MdsShakil) [07:54:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [07:54:46] ah great [07:55:05] well I have some issue with the deployment tool unfortunately [07:55:16] it thinks the previous change is still being deployed :] [07:56:20] Func: I forgot scap was waiting for the test on mwdebug, so I am now rolling the thai logo update to everything [07:56:46] I have too many windows [07:58:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [08:00:42] !log installing flask security updates on bullseye [08:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:23] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:935876|thwiki: Update logos from commons (T341407)]] (duration: 25m 32s) [08:02:27] T341407: Update th.wikipedia.org logo - https://phabricator.wikimedia.org/T341407 [08:02:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:02:44] MdsShakil: finally doing your change :) [08:02:59] !log hashar@deploy1002 Started scap: Backport for [[gerrit:934614|Deploy action blocks on bnwiki (T340904)]] [08:03:02] T340904: Deploy action blocks on bnwiki - https://phabricator.wikimedia.org/T340904 [08:03:35] MdsShakil: which I get can be deployed entirely or do you want to test it? [08:03:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442 (10Joe) [08:04:15] hashar: Your preference :) [08:04:21] !log hashar@deploy1002 hashar and mdsshakil: Backport for [[gerrit:934614|Deploy action blocks on bnwiki (T340904)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:04:25] !log installing c-ares security updates on buster [08:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:51] MdsShakil: it is on mwdebug servers if you wanna test :] [08:05:02] given I don't know anything about that feature [08:05:14] Looks good [08:05:23] lets go! [08:06:55] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Urbanecm_WMF) [08:07:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:07:45] (03PS1) 10Clément Goubert: Revert "trafficserver: Send testwiki traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/936417 [08:07:59] (03CR) 10CI reject: [V: 04-1] Revert "trafficserver: Send testwiki traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/936417 (owner: 10Clément Goubert) [08:08:04] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Urbanecm_WMF) [08:09:15] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10RhinosF1) I'm pretty sure to be in 'wmf' a @wikimedia.org email needs to be linked. Looks like your ldap account is @wikimedia.cz [08:09:36] (03PS2) 10Clément Goubert: Revert "trafficserver: Send testwiki traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/936417 [08:10:14] hashar: Let me know when you done with backport. I plan to deploy cxserver. [08:10:24] it is almost complete [08:10:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442 (10Joe) Things that I don't think we have to create such a cookbook: * programmatic way to merge changes in gerrit. I'm not sure if this could have some... [08:11:15] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:934614|Deploy action blocks on bnwiki (T340904)]] (duration: 08m 15s) [08:11:19] T340904: Deploy action blocks on bnwiki - https://phabricator.wikimedia.org/T340904 [08:11:22] !log UTC morning backport window completed. [08:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:26] kart_: all your :-] [08:15:06] hashar: Thanks. [08:15:09] (03CR) 10Hashar: "recheck after https://gerrit.wikimedia.org/r/c/integration/config/+/936296/" [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936297 (owner: 10Ssingh) [08:15:56] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Urbanecm_WMF) >>! In T341443#9000250, @RhinosF1 wrote: > I'm pretty sure to be in 'wmf' a @wikimedia.org email needs to be linked. Done. [08:16:10] There is undeployed change "mesh.configuration: Update all charts to 1.3.2" in cxserver (and probably other services also). Is that OK to go ahead with this? _joe_ akosiaris? [08:16:32] jayme: ^ [08:16:40] <_joe_> kart_: jayme is who you want to ask to :D [08:16:51] kart_: yes please! [08:17:10] Cool. Thanks! [08:17:44] "should be the last one for some time" 😇 [08:18:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "minor nit, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/936417 (owner: 10Clément Goubert) [08:19:14] <_joe_> jayme: lol [08:19:20] (03PS3) 10Clément Goubert: Revert "trafficserver: Send testwiki traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/936417 [08:19:32] "we should be fine and stable now" [08:19:35] x) [08:19:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "trafficserver: Send testwiki traffic to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/936417 (owner: 10Clément Goubert) [08:20:02] for this week indeed :-p [08:20:16] Yes, please keep everything fine and stable this week [08:20:18] I am on call [08:20:20] :P [08:20:36] <_joe_> claime: oh then elukey has some surprises for you [08:20:54] Can he have these surprises tomorrow [08:20:58] I'm not on call tomorrow [08:21:00] :D [08:21:02] <_joe_> lol [08:21:13] <_joe_> elukey: please hurry with your changes [08:21:40] * claime groans at _joe_'s conception of a birthday present [08:22:33] (03CR) 10Clément Goubert: [C: 03+2] Revert "trafficserver: Send testwiki traffic to mw-on-k8s" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936417 (owner: 10Clément Goubert) [08:22:57] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-07-10-065135-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936391 (https://phabricator.wikimedia.org/T337719) (owner: 10KartikMistry) [08:23:54] (03Merged) 10jenkins-bot: Update cxserver to 2023-07-10-065135-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/936391 (https://phabricator.wikimedia.org/T337719) (owner: 10KartikMistry) [08:24:36] !log Running puppet on cp-text hosts - T337489 [08:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:39] T337489: Run QTE test suite on testwiki on kubernetes - https://phabricator.wikimedia.org/T337489 [08:25:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [08:26:12] PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:39] claime: o/ [08:26:59] elukey: \o [08:27:03] (03PS1) 10Btullis: Use an internal schema registry for datahub on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936651 (https://phabricator.wikimedia.org/T329514) [08:27:14] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [08:27:19] claime: to be fair you nerd-sniped me into the task so if I produce code reviews during your on-call shift is only karma :) [08:27:35] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [08:27:50] elukey: the changeprop task ? [08:28:02] claime: correct yes, you have a code review for eventgate :) [08:28:16] elukey: https://www.youtube.com/watch?v=hd1ciPnTGKg [08:29:21] lol [08:31:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [08:31:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet [08:32:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [08:32:56] (03CR) 10Clément Goubert: [C: 03+2] Fix CirrusSearchJobQueueLagTooHigh to use histograms [alerts] - 10https://gerrit.wikimedia.org/r/936070 (owner: 10Alexandros Kosiaris) [08:33:42] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:25] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [08:41:07] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [08:44:09] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-codfw [08:45:36] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [08:46:10] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [08:47:42] !log Updated cxserver to 2023-07-10-065135-production (T337719, T340989) [08:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:46] T340989: MinT not working for Bhojpuri in Content & Section Translation - https://phabricator.wikimedia.org/T340989 [08:47:47] T337719: CX: Replace calls to the deprecated mobile content REST API - https://phabricator.wikimedia.org/T337719 [08:48:11] !log installing libxpm security updates [08:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:13] I’ll deploy a security patch if that’s alright with everyone [08:51:21] going ahead [08:54:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [08:55:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-codfw [08:55:51] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [08:56:46] PROBLEM - Host aux-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:57:12] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-eqiad [08:57:52] PROBLEM - Host kubetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:02] !log lucaswerkmeister-wmde: Deployed security patch for T340220 [08:59:09] (03PS2) 10Btullis: Use an internal schema registry for datahub on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936651 (https://phabricator.wikimedia.org/T329514) [08:59:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: eqiad: bootstrap hiera data [puppet] - 10https://gerrit.wikimedia.org/r/936022 (https://phabricator.wikimedia.org/T341200) (owner: 10Arturo Borrero Gonzalez) [08:59:13] * Lucas_WMDE done [08:59:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:00:30] RECOVERY - Host aux-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [09:00:34] RECOVERY - Host kubetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [09:00:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [09:00:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [09:01:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [09:04:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:04:39] !log rebalance ganeti clusters in esams/ulsfo/eqsin following reboots [09:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:04] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb1001.eqiad.wmnet with OS bullseye [09:06:17] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet... [09:07:44] !log installing cups security updates (libs only) [09:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:46] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/936372 (owner: 10Majavah) [09:08:05] (03CR) 10Jbond: [C: 03+2] mailmap: expand mailmap [puppet] - 10https://gerrit.wikimedia.org/r/936372 (owner: 10Majavah) [09:08:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-eqiad [09:10:23] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) p:05Triage→03High [09:11:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [09:12:01] !log restarting mw canaries to pick up libxpm security update [09:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:03] (03CR) 10Btullis: [C: 03+2] Use an internal schema registry for datahub on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936651 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [09:14:11] (03Merged) 10jenkins-bot: Use an internal schema registry for datahub on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936651 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [09:14:12] !log depool cp2037 (debugging ATS cacheability issues) - T320967 [09:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:16] T320967: [AQS 2.0] New Service Request device_analytics - https://phabricator.wikimedia.org/T320967 [09:14:41] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: add gateway routing script, route device-analytics on cp2037 [puppet] - 10https://gerrit.wikimedia.org/r/936509 (https://phabricator.wikimedia.org/T320967) (owner: 10Vgutierrez) [09:15:52] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/934371 (owner: 10Muehlenhoff) [09:16:31] (03PS2) 10Elukey: profile::kafka: update prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) [09:17:43] (03CR) 10Elukey: "Tested on kafka-test1006, the metrics are displayed correctly:" [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [09:20:36] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic, 10netops: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10Volans) Adding #traffic for awareness. [09:20:41] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic, 10netops: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10Volans) [09:22:13] (03CR) 10Jbond: "@Kieth, feel free to merge this yuor self if you are happy or we can do it together when you are online" [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond) [09:23:03] !log rebalance ganeti group codfw/A after reboots [09:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [09:25:09] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [09:25:11] (03CR) 10Clément Goubert: [C: 03+1] services: allow kafka batches in EventGate's main producer [deployment-charts] - 10https://gerrit.wikimedia.org/r/936515 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [09:25:50] PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:25:51] (03CR) 10Muehlenhoff: [C: 03+2] mirrors: Add monitoring for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff) [09:26:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/934371 (owner: 10Muehlenhoff) [09:26:42] PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:00] PROBLEM - Host kubestagetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:03] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb1001 - aborrero@cumin1001" [09:29:34] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb1001 - aborrero@cumin1001" [09:29:34] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:30:36] RECOVERY - Host kubestagetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [09:30:56] RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [09:31:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [09:31:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [09:31:26] RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [09:31:42] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [09:33:43] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb1002 - aborrero@cumin1001" [09:33:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Very good job overall! Your tests don't pass because you need to provide a list of kafka brokers to your tests for deployments, that's don" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935771 (owner: 10Kamila Součková) [09:33:56] PROBLEM - purged service on cp2037 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:34:11] (03PS1) 10Btullis: Enable the kafka-setup job for datahub in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936656 (https://phabricator.wikimedia.org/T329514) [09:34:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [09:34:51] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf LDAP group for Urbanecm - https://phabricator.wikimedia.org/T341443 (10Urbanecm_WMF) [09:35:04] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb1002 - aborrero@cumin1001" [09:35:04] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:35:08] (03CR) 10Elukey: [C: 03+2] profile::kafka: update prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/936304 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [09:35:28] RECOVERY - purged service on cp2037 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:35:35] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb1001.eqiad.wmnet with reason: host reimage [09:37:17] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Puppet Profiler - https://phabricator.wikimedia.org/T341448 (10jbond) [09:37:26] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Puppet Profiler - https://phabricator.wikimedia.org/T341448 (10jbond) p:05Triage→03Medium [09:38:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [09:38:42] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb1001.eqiad.wmnet with reason: host reimage [09:39:16] ACKNOWLEDGEMENT - Host mr1-eqsin.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2403:b100:3001:9::2) Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:38:54. [09:39:26] 10SRE, 10Observability-Alerting, 10Traffic, 10collaboration-services, 10serviceops-radar: Timeouts when talking to phabricator API - https://phabricator.wikimedia.org/T341039 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi >>! In T341039#8995349, @Aklapper wrote: > Hmm. The problem //could// be rel... [09:39:28] ACKNOWLEDGEMENT - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:39:17. [09:39:32] !log rebalance ganeti group codfw/B after reboots [09:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:46] ACKNOWLEDGEMENT - Juniper alarms on asw1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.132.128.4 Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:39:33. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:39:46] ACKNOWLEDGEMENT - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:39:33. [09:40:02] ACKNOWLEDGEMENT - BFD status on cr2-eqsin is CRITICAL: Down: 1 Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:39:52. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:40:12] ACKNOWLEDGEMENT - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:40:03. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:40:24] ACKNOWLEDGEMENT - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:40:13. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:41:07] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: open radosgw API to the internet [puppet] - 10https://gerrit.wikimedia.org/r/936657 (https://phabricator.wikimedia.org/T341380) [09:41:14] ACKNOWLEDGEMENT - ps1-604-eqsin-infeed-load-tower-B-single-phase on ps1-604-eqsin is CRITICAL: CRITICAL - Plugin timed out while executing system call Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:41:14] ACKNOWLEDGEMENT - ps1-604-eqsin-infeed-load-tower-A-single-phase on ps1-604-eqsin is CRITICAL: CRITICAL - Plugin timed out while executing system call Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:04. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:41:27] ACKNOWLEDGEMENT - ps1-603-eqsin-infeed-load-tower-B-single-phase on ps1-603-eqsin is CRITICAL: CRITICAL - Plugin timed out while executing system call Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:16. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:41:27] ACKNOWLEDGEMENT - ps1-603-eqsin-infeed-load-tower-A-single-phase on ps1-603-eqsin is CRITICAL: CRITICAL - Plugin timed out while executing system call Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:16. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:41:48] (03CR) 10Muehlenhoff: [C: 03+2] profile::java: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/934371 (owner: 10Muehlenhoff) [09:41:52] ACKNOWLEDGEMENT - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:42. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:41:52] ACKNOWLEDGEMENT - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 Cathal Mooney mr1-eqsin down - The acknowledgement expires at: 2023-07-12 09:41:42. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:44:50] !log aborrero@cumin1001 START - Cookbook sre.dns.wipe-cache cloudlb1002.private.eqiad.wikimedia.cloud on all recursors [09:44:53] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb1002.private.eqiad.wikimedia.cloud on all recursors [09:45:50] (03CR) 10Btullis: [C: 03+2] Enable the kafka-setup job for datahub in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936656 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [09:46:47] (03Merged) 10jenkins-bot: Enable the kafka-setup job for datahub in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/936656 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [09:49:24] (03PS4) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) [09:49:41] (03CR) 10David Caro: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936373 (https://phabricator.wikimedia.org/T325466) (owner: 10Majavah) [09:50:31] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp2037.codfw.wmnet with reason: vgutierrez debugging [09:50:32] (03CR) 10David Caro: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936376 (https://phabricator.wikimedia.org/T325466) (owner: 10Majavah) [09:50:44] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp2037.codfw.wmnet with reason: vgutierrez debugging [09:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:52:51] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1033.eqiad.wmnet [09:52:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [09:53:00] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:53:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:53:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1033.eqiad.wmnet [09:56:09] Checking parsoid latency [09:56:31] Because it's getting like 300 rps so it shouldn't really be overloaded... [09:57:14] (03PS1) 10Btullis: Bump the datahub top-level chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/936658 (https://phabricator.wikimedia.org/T329514) [09:58:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:59:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:59:30] It's hovering right around the threshold and flapping [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1000) [10:01:59] It actually started ramping up during the night and hasn't really come down [10:02:28] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin1001" [10:03:11] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin1001" [10:03:12] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb1001.eqiad.wmnet with OS bullseye [10:03:19] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudlb1001.eqiad.wmnet with OS bullseye completed... [10:05:32] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb1002.eqiad.wmnet with OS bullseye [10:11:36] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudlb1002.eqiad.wmnet with OS bullseye [10:11:39] (03PS1) 10Majavah: P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 [10:11:41] (03PS1) 10Majavah: P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 [10:11:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [10:12:27] !log repooling parse1012.eqiad.wmnet [10:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:36] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=parsoid,name=parse1012.* [10:13:07] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudlb1002.eqiad.wmnet with OS bullseye [10:13:32] RECOVERY - mediawiki-installation DSH group on parse1012 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:14:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [10:19:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:21:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet [10:21:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet [10:21:46] (03PS5) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 [10:21:48] (03PS1) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [10:22:12] (03CR) 10CI reject: [V: 04-1] k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:23:09] (03CR) 10Btullis: [C: 03+2] Bump the datahub top-level chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/936658 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:23:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:23:53] (03Merged) 10jenkins-bot: Bump the datahub top-level chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/936658 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [10:25:28] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb1002.eqiad.wmnet with reason: host reimage [10:26:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [10:26:21] (03PS2) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [10:27:10] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:28:23] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:28:39] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb1002.eqiad.wmnet with reason: host reimage [10:31:07] (03PS1) 10Vgutierrez: Revert "trafficserver: add gateway routing script, route device-analytics on cp2037" [puppet] - 10https://gerrit.wikimedia.org/r/936422 [10:33:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:33:21] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:37] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:35:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:38:54] (03CR) 10Vgutierrez: [C: 03+2] Revert "trafficserver: add gateway routing script, route device-analytics on cp2037" [puppet] - 10https://gerrit.wikimedia.org/r/936422 (owner: 10Vgutierrez) [10:40:26] (03CR) 10Elukey: [C: 03+2] services: allow kafka batches in EventGate's main producer [deployment-charts] - 10https://gerrit.wikimedia.org/r/936515 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [10:42:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:43:25] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp2037.codfw.wmnet [10:43:25] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2037.codfw.wmnet [10:43:48] (03PS1) 10Hashar: Review access change [dns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/936423 [10:44:35] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [10:44:44] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [10:45:12] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:45:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:45:44] (03PS2) 10Hashar: Grant permission to ldap/dns-admins [dns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/936423 (https://phabricator.wikimedia.org/T341440) [10:46:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [dns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/936423 (https://phabricator.wikimedia.org/T341440) (owner: 10Hashar) [10:46:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [10:46:47] (03CR) 10Hashar: [V: 03+2 C: 03+2] Grant permission to ldap/dns-admins [dns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/936423 (https://phabricator.wikimedia.org/T341440) (owner: 10Hashar) [10:47:02] <_joe_> claime: looks like parsoid's latency went down suddenly [10:47:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:49:06] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10Ifrahkhanyaree) [10:49:31] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb1002.eqiad.wmnet with OS bullseye [10:50:14] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [10:50:33] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Ifrahkhanyaree (Ifrah_WMDE) - https://phabricator.wikimedia.org/T341455 (10WMDE-leszek) I confirm Ifrah uses the account mentioned and she's a Product Manager employed at WMDE. Thank you for processing the request. [10:50:37] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [10:51:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:51:37] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:53:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [10:54:15] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10hashar) The Gerrit configuration change grants members of dns-admins {nav Code-Review +2} and {nav Submit} which should be all what is needed. Note t... [10:55:05] !log failover ganeti master in eqiad to ganeti1029 [10:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:57:29] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:58:19] PROBLEM - ganeti-wconfd running on ganeti1028 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:59:23] PROBLEM - HTTPS Ganeti RAPI eqiad on ganeti1028 is CRITICAL: connect to address ganeti01.svc.eqiad.wmnet and port 5080: No route to host https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [11:00:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:02:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:03:21] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:05:36] (03PS1) 10Btullis: Disable the kafka-setup job in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936670 (https://phabricator.wikimedia.org/T329514) [11:05:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet [11:09:35] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqiad_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:44] (03CR) 10Btullis: [C: 03+2] Disable the kafka-setup job in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936670 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:10:30] (03Merged) 10jenkins-bot: Disable the kafka-setup job in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936670 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:10:49] (03PS3) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [11:11:05] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti6003.drmrs.wmnet [11:11:43] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:12:21] PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:13:16] (03PS4) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [11:14:27] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:14:32] !log remove unused VM netflow6002 T330884 [11:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:35] T330884: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 [11:14:53] 10SRE, 10Continuous-Integration-Infrastructure: Puppet package_builder module should have the apt cache auto cleaned - https://phabricator.wikimedia.org/T339251 (10hashar) Should be good now. I have previously removed all caches from the CI instances so it is unlikely we can check the result of this change the... [11:15:26] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 28 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [11:15:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 28 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [11:16:01] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:16:13] (03PS5) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [11:22:33] (03PS1) 10Muehlenhoff: Add dns-admins to list of sensitive groups [puppet] - 10https://gerrit.wikimedia.org/r/936674 (https://phabricator.wikimedia.org/T341440) [11:22:49] (03PS2) 10Vivian Rook: P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [11:23:20] (03PS1) 10Btullis: Use plaintext port 8080 for local schema registry in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936675 (https://phabricator.wikimedia.org/T329514) [11:23:24] (03PS2) 10Vivian Rook: P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [11:23:35] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:23:38] (03CR) 10Vivian Rook: [C: 03+1] P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [11:23:44] (03CR) 10Vivian Rook: [C: 03+1] P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [11:24:38] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) The new group has been documented under https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#Primary_groups [11:24:58] (03CR) 10Btullis: [C: 03+2] Use plaintext port 8080 for local schema registry in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936675 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:25:45] (03CR) 10CI reject: [V: 04-1] P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [11:26:02] (03CR) 10CI reject: [V: 04-1] P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [11:26:04] (03Merged) 10jenkins-bot: Use plaintext port 8080 for local schema registry in datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/936675 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:26:57] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:27:33] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) @Jgreen and @Dwisehaupt I have removed you from the cn=ops LDAP group and added you to cn=dns-admins (which has the permissions to... [11:28:41] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:28:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet [11:29:07] (03PS3) 10Majavah: P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) [11:29:09] (03PS3) 10Majavah: P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) [11:29:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet [11:30:30] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:31:44] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:34:02] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:35:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet [11:36:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6003.drmrs.wmnet [11:36:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:38:53] (03CR) 10Vivian Rook: [C: 03+1] P:openstack: open eqiad1 magnum api to the public [puppet] - 10https://gerrit.wikimedia.org/r/936664 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [11:38:58] (03CR) 10Vivian Rook: [C: 03+1] P:openstack: move magnum fw rules to haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/936663 (https://phabricator.wikimedia.org/T341459) (owner: 10Majavah) [11:39:26] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/935882 [11:41:00] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:42:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet [11:49:21] (03PS1) 10Jgreen: Remove payments-listener-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/936686 (https://phabricator.wikimedia.org/T340128) [11:49:59] _joe_: It went down right after I re-added parse1012, then went back up [11:50:04] It's being really spiky [11:51:31] (03PS1) 10Jbond: config-master: drop ssh-fingerprints.txt file [puppet] - 10https://gerrit.wikimedia.org/r/936691 (https://phabricator.wikimedia.org/T340947) [11:51:33] (03PS1) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [11:52:10] (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [11:52:12] !log repool cp2037 (debugging finished) - T320967 [11:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:15] T320967: [AQS 2.0] New Service Request device_analytics - https://phabricator.wikimedia.org/T320967 [11:53:56] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:54:04] (03CR) 10Jgreen: [C: 03+2] Remove payments-listener-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/936686 (https://phabricator.wikimedia.org/T340128) (owner: 10Jgreen) [11:54:14] (03CR) 10CI reject: [V: 04-1] config-master: drop ssh-fingerprints.txt file [puppet] - 10https://gerrit.wikimedia.org/r/936691 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [11:55:31] !log installing avahi security updates [11:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:44] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:55:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [11:56:30] (03PS6) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [11:57:10] (03PS2) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [11:57:40] (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [11:57:59] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10Jgreen) >>! In T341440#9000902, @MoritzMuehlenhoff wrote: > @Jgreen and @Dwisehaupt I have removed you from the cn=ops LDAP group and added you to cn... [11:58:36] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Vgutierrez) [11:58:53] 10SRE: Cronspam from acmechief-test1001 - https://phabricator.wikimedia.org/T295770 (10Vgutierrez) 05Open→03Resolved Instance haven't produced cronspam since Nov 2021 [12:01:36] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:02:12] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) Hmmh, won't you need additional sudo privileges to run dnsauth-update? Or did you trigger this indirectly via the sre.dns.netbox c... [12:02:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [12:02:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet [12:04:55] !log failover ganeti masters in drmrs [12:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:06] (03PS7) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [12:08:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [12:09:10] PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:09:32] PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:09:52] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10Jgreen) >>! In T341440#9000960, @MoritzMuehlenhoff wrote: > Hmmh, won't you need additional sudo privileges to run dnsauth-update? Or did you trigger... [12:10:07] topranks: I assume your email for CRT-009240 is related to the cr1 alerts above? [12:10:57] yep the cr1-eqiad and cr1-drmrs alerts [12:11:14] I've noticed the cloudsw one now also, that's probably new host cloudlb but I'll have a look [12:12:09] ack [12:12:26] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:16:16] (03PS5) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) [12:18:00] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Direct 0.5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341078 (10Clement_Goubert) 05In progress→03Resolved [12:18:12] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [12:18:24] 10SRE-Sprint-Week-Sustainability-March2023, 10Phabricator, 10collaboration-services, 10serviceops-radar, and 2 others: Phabricator: Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Aklapper) 05Stalled→03Open >>! In T313879#8531556, @LSobanski wrote: > To be inves... [12:18:34] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [12:18:38] (03CR) 10Majavah: [C: 04-1] templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [12:19:15] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) [12:19:23] 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10BTullis) p:05Triage→03High [12:19:32] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [12:19:44] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [12:20:23] (03PS6) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) [12:20:57] (03PS7) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) [12:21:09] (03PS1) 10Clément Goubert: mw-on-k8s: Redirect 1% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/936697 (https://phabricator.wikimedia.org/T341078) [12:21:34] (03CR) 10Arturo Borrero Gonzalez: templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [12:22:07] (03CR) 10Majavah: [C: 03+1] templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [12:22:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] templates/56.15.185.in-addr.arpa: delegate 185.15.56.0/25 to designate @ eqiad1 [dns] - 10https://gerrit.wikimedia.org/r/936257 (https://phabricator.wikimedia.org/T341338) (owner: 10Arturo Borrero Gonzalez) [12:22:30] (03PS2) 10Clément Goubert: mw-on-k8s: Redirect 1% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/936697 (https://phabricator.wikimedia.org/T341078) [12:24:27] btullis: FYI datahub-mae-consumer-main container is spamming a ton of exceptions in logs on kubestage [12:25:03] godog: Sorry, will destroy the deployment now. [12:25:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:25:24] btullis: ack, thank you [12:26:15] godog: done. [12:26:54] (03PS3) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [12:27:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Redirect 1% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/936697 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert) [12:27:19] (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [12:28:00] cheers [12:29:43] (03PS3) 10Clément Goubert: mw-on-k8s: Redirect 1% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/936697 (https://phabricator.wikimedia.org/T341463) [12:30:08] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:30:52] (03PS4) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [12:31:19] (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [12:32:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/936674 (https://phabricator.wikimedia.org/T341440) (owner: 10Muehlenhoff) [12:33:28] !log Sending 1% of global traffic to mw-on-k8s - T341463 [12:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:32] T341463: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 [12:33:32] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect 1% of all traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/936697 (https://phabricator.wikimedia.org/T341463) (owner: 10Clément Goubert) [12:33:47] (03PS8) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [12:34:22] !log Running puppet on cp-text hosts - T341463 [12:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:08] RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:37:20] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:37:34] (03PS9) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [12:39:18] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42370/console" [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:41:41] 10SRE, 10Infrastructure-Foundations, 10netops: BFD flapping from cloudsw1-c8-eqiad (QFX5100) - https://phabricator.wikimedia.org/T341466 (10cmooney) p:05Triage→03Medium [12:43:26] (03PS5) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [12:43:55] (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [12:46:38] PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:50:24] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:50:25] (03PS6) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [12:50:51] (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [12:54:34] 10SRE, 10ops-eqsin, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10RobH) Order Number - 1-228138359365 entered for remote hands to power cycle the device and reply back to the ticket to let us... [12:54:38] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqiad_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:22] (03PS7) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [12:55:28] 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10MoritzMuehlenhoff) Looks good [12:58:07] (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [12:59:30] (03PS8) 10Jbond: ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [12:59:38] (03CR) 10Muehlenhoff: [C: 03+2] Add dns-admins to list of sensitive groups [puppet] - 10https://gerrit.wikimedia.org/r/936674 (https://phabricator.wikimedia.org/T341440) (owner: 10Muehlenhoff) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1300). [13:00:05] arlolra: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:35] I can deploy [13:00:47] (03PS1) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [13:02:17] (03CR) 10CI reject: [V: 04-1] ssh :switch to using exported resources [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [13:02:42] (03CR) 10Lucas Werkmeister (WMDE): Disable wgParserEnableLegacyMediaDOM on group2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:03:32] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-aborrero: gerrit.w.o is not included in https://config-master.wikimedia.org/known_hosts - https://phabricator.wikimedia.org/T340947 (10jbond) i have a patch out however id like to sort the results before merging this which will be much easier... [13:03:55] (03CR) 10Lucas Werkmeister (WMDE): Disable wgParserEnableLegacyMediaDOM on group2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:04:21] (03CR) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on group2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:04:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Disable wgParserEnableLegacyMediaDOM on group2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:04:53] (03PS2) 10Lucas Werkmeister (WMDE): Disable wgParserEnableLegacyMediaDOM on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:05:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:05:43] (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936322 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [13:05:59] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:936322|Disable wgParserEnableLegacyMediaDOM on group2 wikis (T314318)]] [13:06:03] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [13:07:29] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) >>! In T341440#9001013, @Jgreen wrote: >>>! In T341440#9000960, @MoritzMuehlenhoff wrote: >> Hmmh, won't you need additional sudo... [13:07:36] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and arlolra: Backport for [[gerrit:936322|Disable wgParserEnableLegacyMediaDOM on group2 wikis (T314318)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:07:54] arlolra: can you test on mwdebug? [13:08:12] Yup [13:09:03] (03CR) 10Ssingh: "Ready for review." [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936297 (owner: 10Ssingh) [13:10:09] Lucas_WMDE: looks good [13:10:35] alright, syncing then [13:10:48] Thank you [13:11:10] 10SRE, 10Data-Platform-SRE, 10vm-requests: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10BTullis) [13:11:57] * claime watches mw-on-k8s 503s on deployment [13:12:01] Did we solve the problem? :D [13:12:17] hm? [13:12:58] Lucas_WMDE: We used to have 503s when redeploying mw-on-k8s because of an improper shutdown order of containers, and kubernetes being werid [13:13:01] weird* [13:13:06] ah [13:13:39] I guess you have a chance to find out? ^^ [13:13:46] 'xactly :D [13:13:48] my scap already finished the running helmfile parts fwiw [13:13:51] ack [13:13:57] just reached the php-fpm-restart [13:14:19] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:14:21] I think we got maybe 1 or 2 on api-ext, none on web [13:14:33] So far so good [13:14:43] cool [13:15:52] random question – do we know where all these jsonTruncated messages in logstash come from? [13:16:14] PROBLEM - ganeti-wconfd running on ganeti1029 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:16:15] the mediawiki-errors raw events list is currently just jsonTruncated, nothing else (among the 1–50 entries) – not extremely useful [13:16:16] !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host karapace1002.eqiad.wmnet [13:16:17] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [13:16:26] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:936322|Disable wgParserEnableLegacyMediaDOM on group2 wikis (T314318)]] (duration: 10m 26s) [13:16:29] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [13:16:42] (03PS2) 10Ilias Sarantopoulos: ores extension: deploy LiftWing usage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935743 (https://phabricator.wikimedia.org/T319170) [13:16:47] arlolra: should be done [13:16:48] Lucas_WMDE: I think it's because it's sending too big messages, godog may know more [13:16:59] Lucas_WMDE: great, thank you [13:17:05] Lucas_WMDE: once done, please ping me, I have a bunch of stuff to deploy [13:17:11] Amir1: I’m done, go ahead [13:17:15] oh thanks [13:17:18] ^^ [13:17:26] (03CR) 10Ladsgroup: [C: 03+2] ExternalLinks: Make order by and continue only rely on el_id in READ NEW [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935856 (https://phabricator.wikimedia.org/T341000) (owner: 10Ladsgroup) [13:17:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:17:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:17:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:44] Oh great [13:17:46] (03CR) 10Ladsgroup: [C: 03+2] ores extension: deploy LiftWing usage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935743 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [13:17:54] claime Lucas_WMDE ack, I'll check [13:18:04] 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) [13:18:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935743 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [13:18:42] (03Merged) 10jenkins-bot: ores extension: deploy LiftWing usage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935743 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [13:18:44] godog: I don’t think it’s particularly new, I just figured I’d ask [13:18:52] whether we know what’s sending the too-long messages, that is [13:18:59] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:935743|ores extension: deploy LiftWing usage on testwiki (T319170)]] [13:19:02] T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 [13:19:07] Lucas_WMDE: oh ok, yeah I'm not sure right off the bat [13:19:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936297 (owner: 10Ssingh) [13:20:00] o/ [13:20:22] !log ladsgroup@deploy1002 isaranto and ladsgroup: Backport for [[gerrit:935743|ores extension: deploy LiftWing usage on testwiki (T319170)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:20:24] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:26] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM karapace1002.eqiad.wmnet - btullis@cumin1001" [13:21:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM karapace1002.eqiad.wmnet - btullis@cumin1001" [13:21:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:21:10] !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache karapace1002.eqiad.wmnet on all recursors [13:21:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) karapace1002.eqiad.wmnet on all recursors [13:21:19] Lucas_WMDE: jsonTruncated messages is logstash receiving messages from MediaWiki that are too long. An exemple is logging a large SQL query (like a deadlock when batch inserting a lot of fields) [13:21:38] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM karapace1002.eqiad.wmnet - btullis@cumin1001" [13:22:12] ah ok [13:22:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM karapace1002.eqiad.wmnet - btullis@cumin1001" [13:22:30] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:22:30] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:22:32] I thnk there is a dashboard dedicated to them but one has to look at the truncated raw json to find out the source [13:22:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:23:23] and I think there is some Grafana board tracking them as well as other logstash ingestion errors. [13:23:48] PROBLEM - puppet last run on logstash1025 is CRITICAL: CRITICAL: Puppet has been disabled for 604864 seconds, message: jmm, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:23:59] (PuppetDisabled) firing: Puppet disabled on logstash1025:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=logstash&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:24:13] (03PS1) 10Ssingh: ntp/eqiad: point to dns1004 [dns] - 10https://gerrit.wikimedia.org/r/936703 (https://phabricator.wikimedia.org/T326685) [13:26:13] (03CR) 10Fabfur: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/936703 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [13:26:32] (03CR) 10Ssingh: [C: 03+2] ntp/eqiad: point to dns1004 [dns] - 10https://gerrit.wikimedia.org/r/936703 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [13:26:34] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 243.05 ms [13:27:00] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 239.32 ms [13:27:02] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host karapace1002.eqiad.wmnet with OS bullseye [13:27:08] 10SRE, 10Data-Platform-SRE, 10vm-requests: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host karapace1002.eqiad.wmnet with OS bullseye [13:27:08] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [13:27:34] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:27:38] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 218.45 ms [13:27:40] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [13:28:02] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:935743|ores extension: deploy LiftWing usage on testwiki (T319170)]] (duration: 09m 02s) [13:28:05] T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 [13:28:09] !log bking@cumin1001 conftool action : set/pooled=yes; selector: name=wdqs2020.codfw.wmnet [13:28:22] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:28:40] RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:32:43] (03PS1) 10Btullis: Add a second karapace VM [puppet] - 10https://gerrit.wikimedia.org/r/936706 (https://phabricator.wikimedia.org/T329514) [13:33:21] (03PS1) 10Ssingh: dns1005: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936709 (https://phabricator.wikimedia.org/T326685) [13:33:23] (03PS1) 10Ssingh: dns1006: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936710 (https://phabricator.wikimedia.org/T326685) [13:33:47] (03CR) 10Btullis: [C: 03+2] Add a second karapace VM [puppet] - 10https://gerrit.wikimedia.org/r/936706 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:34:24] (03Merged) 10jenkins-bot: ExternalLinks: Make order by and continue only rely on el_id in READ NEW [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/935856 (https://phabricator.wikimedia.org/T341000) (owner: 10Ladsgroup) [13:35:09] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:935856|ExternalLinks: Make order by and continue only rely on el_id in READ NEW (T341000 T47237)]] [13:35:15] T47237: LinkSearch uses numeric offset paging instead of paging by last entry returned - https://phabricator.wikimedia.org/T47237 [13:36:27] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on karapace1002.eqiad.wmnet with reason: host reimage [13:36:39] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:935856|ExternalLinks: Make order by and continue only rely on el_id in READ NEW (T341000 T47237)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:39:19] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:39:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on karapace1002.eqiad.wmnet with reason: host reimage [13:40:11] (03CR) 10Fabfur: [C: 03+1] "IP seems correct (checked on NetBox)" [puppet] - 10https://gerrit.wikimedia.org/r/936709 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [13:40:18] (03CR) 10Fabfur: "IP seems correct (checked on NetBox)" [puppet] - 10https://gerrit.wikimedia.org/r/936710 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [13:42:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [13:44:22] (03CR) 10Ssingh: [C: 03+2] dns1005: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936709 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [13:46:12] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:935856|ExternalLinks: Make order by and continue only rely on el_id in READ NEW (T341000 T47237)]] (duration: 11m 03s) [13:46:17] T47237: LinkSearch uses numeric offset paging instead of paging by last entry returned - https://phabricator.wikimedia.org/T47237 [13:46:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:47:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:47:30] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host dns1005.wikimedia.org with OS bullseye [13:47:40] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host dns1005.wikimedia.org with OS bullseye [13:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (3) Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [13:48:55] (03PS1) 10Ladsgroup: Set commons to READ_NEW for externallinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936716 (https://phabricator.wikimedia.org/T335343) [13:50:37] (03CR) 10Ladsgroup: [C: 03+2] Set commons to READ_NEW for externallinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936716 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [13:51:14] (03PS1) 10Samtar: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763) [13:51:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936716 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [13:51:28] (03Merged) 10jenkins-bot: Set commons to READ_NEW for externallinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936716 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [13:51:41] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:936716|Set commons to READ_NEW for externallinks migration (T335343)]] [13:51:44] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [13:52:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:52:34] (03PS2) 10Samtar: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763) [13:52:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host karapace1002.eqiad.wmnet with OS bullseye [13:52:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host karapace1002.eqiad.wmnet [13:53:03] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host karapace1002.eqiad.wmnet with OS b... [13:53:05] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:936716|Set commons to READ_NEW for externallinks migration (T335343)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:54:43] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for karapace in support of datahub in staging - https://phabricator.wikimedia.org/T341464 (10BTullis) 05Open→03Resolved [13:54:51] (03PS1) 10Gmodena: mw-page-content-change-enrichment partition by (wiki_id, page_id) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936718 (https://phabricator.wikimedia.org/T338169) [13:55:27] (03PS1) 10Ssingh: sites.yaml: add new dns host dns1005 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/936719 (https://phabricator.wikimedia.org/T326685) [13:55:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [13:55:29] (03PS1) 10Ssingh: sites.yaml: add new dns host dns1006 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/936720 (https://phabricator.wikimedia.org/T326685) [13:58:02] (03CR) 10Fabfur: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/936720 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [13:58:06] (03CR) 10Fabfur: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/936719 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [13:58:14] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host contint2002.wikimedia.org [13:58:26] (03CR) 10Ssingh: [C: 03+2] Release pdns-recursor 4.8.4-1+wmf11u1. [debs/pdns-recursor] - 10https://gerrit.wikimedia.org/r/936297 (owner: 10Ssingh) [13:59:32] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1005.wikimedia.org with reason: host reimage [14:01:03] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:936716|Set commons to READ_NEW for externallinks migration (T335343)]] (duration: 09m 22s) [14:01:07] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [14:02:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [14:02:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [14:02:54] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1005.wikimedia.org with reason: host reimage [14:03:58] PROBLEM - Host an-worker1145 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:31] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint2002.wikimedia.org [14:05:14] ACKNOWLEDGEMENT - confd service on an-worker1145 is CRITICAL: CRITICAL - Expecting active but unit confd is activating Btullis Cold booted for T341481 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:05:14] ACKNOWLEDGEMENT - SSH on an-worker1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Btullis Cold booted for T341481 https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:05:14] ACKNOWLEDGEMENT - Hadoop DataNode on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode Btullis Cold booted for T341481 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [14:05:14] ACKNOWLEDGEMENT - Host an-worker1145 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Cold booted for T341481 [14:05:36] (GitLabCIPipelineErrors) firing: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [14:07:26] RECOVERY - Host an-worker1145 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:07:36] RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:38] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:08:21] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:34] RECOVERY - Hadoop DataNode on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [14:10:03] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:10:06] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:10:36] (GitLabCIPipelineErrors) resolved: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [14:13:08] RECOVERY - confd service on an-worker1145 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:13:26] (03PS1) 10Jelto: gitlab: increase thresholds for GitLab CI alerts [alerts] - 10https://gerrit.wikimedia.org/r/936722 (https://phabricator.wikimedia.org/T341384) [14:13:28] 10SRE, 10ops-eqsin, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) Equinix came back and said they rebooted. Device is reachable again: ` cmooney@mr1-eqsin> show system uptime Curren... [14:13:38] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Ripe atlas eqiad reported down in Icinga since 2023-06-27 - https://phabricator.wikimedia.org/T341108 (10Jclark-ctr) 05Open→03Resolved [14:14:19] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:14:48] RECOVERY - puppet last run on an-worker1145 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:23] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:15:27] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:19:19] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:19:22] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:22:37] !log sukhe@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1001" [14:22:38] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341035 (10Jclark-ctr) 05Open→03Resolved Replaced cables , reset idrac [14:22:40] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:22:43] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:23:20] !log sukhe@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1001" [14:23:21] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1005.wikimedia.org with OS bullseye [14:23:31] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host dns1005.wikimedia.org with OS bullseye completed: - dns1005 (**WARN**) - Removed from Puppet an... [14:23:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:26:15] !log sukhe@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "running manually for dns1005 - sukhe@cumin1001" [14:27:09] !log sukhe@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "running manually for dns1005 - sukhe@cumin1001" [14:27:28] (03CR) 10TChin: [C: 03+1] mw-page-content-change-enrichment partition by (wiki_id, page_id) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936718 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena) [14:28:01] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10lmata) [14:28:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:28:41] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:28:46] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2023/2024-Q1), 10Sustainability (Incident Followup): Alert when no data is received from Prometheus in a certain amount of time - https://phabricator.wikimedia.org/T336448 (10lmata) [14:28:47] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:29:21] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns1005 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/936719 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [14:33:18] !log add new dns host dns1005 [14:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:37] (03CR) 10Gmodena: [C: 03+2] mw-page-content-change-enrichment partition by (wiki_id, page_id) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936718 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena) [14:38:16] (03CR) 10Dzahn: [C: 03+1] gitlab: increase thresholds for GitLab CI alerts [alerts] - 10https://gerrit.wikimedia.org/r/936722 (https://phabricator.wikimedia.org/T341384) (owner: 10Jelto) [14:38:23] (03Merged) 10jenkins-bot: mw-page-content-change-enrichment partition by (wiki_id, page_id) [deployment-charts] - 10https://gerrit.wikimedia.org/r/936718 (https://phabricator.wikimedia.org/T338169) (owner: 10Gmodena) [14:40:34] (03CR) 10Dzahn: [C: 03+2] gitlab: increase thresholds for GitLab CI alerts [alerts] - 10https://gerrit.wikimedia.org/r/936722 (https://phabricator.wikimedia.org/T341384) (owner: 10Jelto) [14:45:38] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns1006 (eqiad hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/936720 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [14:46:20] !log tchin@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:46:24] !log tchin@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:46:26] (03CR) 10Ssingh: [C: 03+2] dns1006: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936710 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [14:47:13] (03PS2) 10Ssingh: dns1006: provision new DNS host in eqiad (hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/936710 (https://phabricator.wikimedia.org/T326685) [14:48:14] (03PS1) 10RobH: addming new skus [software] - 10https://gerrit.wikimedia.org/r/936746 [14:48:21] (03CR) 10CI reject: [V: 04-1] addming new skus [software] - 10https://gerrit.wikimedia.org/r/936746 (owner: 10RobH) [14:48:38] RECOVERY - puppet last run on logstash1025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:48:48] (03PS2) 10RobH: addming new skus [software] - 10https://gerrit.wikimedia.org/r/936746 [14:48:59] (PuppetDisabled) resolved: Puppet disabled on logstash1025:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=logstash&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:49:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1006.wikimedia.org with OS bullseye [14:49:27] 10SRE, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1006.wikimedia.org with OS bullseye [14:49:31] (03CR) 10RobH: [C: 03+2] updating R450 skus [software] - 10https://gerrit.wikimedia.org/r/936313 (owner: 10RobH) [14:49:42] (03PS3) 10RobH: addming new skus [software] - 10https://gerrit.wikimedia.org/r/936746 [14:51:35] !log tchin@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:51:38] (03CR) 10RobH: [C: 03+2] addming new skus [software] - 10https://gerrit.wikimedia.org/r/936746 (owner: 10RobH) [14:51:42] !log tchin@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:52:07] (03PS1) 10Andrew Bogott: radosgw: set per-user (aka per-project in swift) quotas. [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937) [14:53:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:54:36] (03CR) 10David Caro: "LGTM, we can re-adjust quotas later if needed" [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937) (owner: 10Andrew Bogott) [14:55:33] (03PS2) 10Andrew Bogott: radosgw: set per-user (aka per-project in swift) quotas. [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937) [14:55:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] envoy: Limit the total number of active connections [puppet] - 10https://gerrit.wikimedia.org/r/935711 (https://phabricator.wikimedia.org/T340955) (owner: 10JMeybohm) [14:56:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] envoy: Remove tls_minimum_protocol_version [puppet] - 10https://gerrit.wikimedia.org/r/935683 (https://phabricator.wikimedia.org/T337453) (owner: 10JMeybohm) [14:56:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet [14:57:32] !log tchin@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [14:57:36] !log tchin@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:58:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:59:03] 10SRE, 10ops-eqsin, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) p:05High→03Medium Device remains healthy after over an hour. In terms of what caused the initial problem the log... [14:59:25] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [15:00:00] !log rebalance ganeti group eqiad/A after reboots [15:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1006.wikimedia.org with reason: host reimage [15:02:47] (03PS1) 10Jsn.sherman: log additional events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) [15:04:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1006.wikimedia.org with reason: host reimage [15:05:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [15:07:16] PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:07:28] ^ vgutierrez what you were talking about [15:08:18] RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:11:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet [15:11:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet [15:15:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet [15:16:00] (03CR) 10Andrew Bogott: radosgw: set per-user (aka per-project in swift) quotas. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937) (owner: 10Andrew Bogott) [15:16:19] (03CR) 10Andrew Bogott: [C: 03+2] radosgw: set per-user (aka per-project in swift) quotas. [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937) (owner: 10Andrew Bogott) [15:16:48] jouncebot: nowandnext [15:16:48] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [15:16:48] In 0 hour(s) and 13 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1530) [15:17:25] 10SRE-swift-storage, 10Observability-Metrics: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [15:18:51] (03PS1) 10Majavah: wikitech: Update codfw1dev LDAP server hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936751 [15:18:53] (03PS1) 10Majavah: Disable UrlShortener on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936752 (https://phabricator.wikimedia.org/T341470) [15:19:31] (03PS1) 10Btullis: Configure karapace1001 to use the kafka-jumbo cluster [puppet] - 10https://gerrit.wikimedia.org/r/936753 (https://phabricator.wikimedia.org/T329514) [15:19:50] (03CR) 10Vgutierrez: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [15:19:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [15:20:47] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [15:21:16] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) I have marked of debmonitor as pki is used in production. [15:21:27] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42376/console" [puppet] - 10https://gerrit.wikimedia.org/r/936753 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:23:41] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [15:23:47] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936754 (https://phabricator.wikimedia.org/T128546) [15:23:49] (03CR) 10Btullis: [V: 03+1 C: 03+2] Configure karapace1001 to use the kafka-jumbo cluster [puppet] - 10https://gerrit.wikimedia.org/r/936753 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:25:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [15:25:14] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1006.wikimedia.org with OS bullseye [15:25:19] 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1006.wikimedia.org with OS bullseye completed: - dns1006 (**PASS**) - Removed from Puppet and PuppetDB if present... [15:26:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [15:26:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6001.drmrs.wmnet [15:27:34] (03PS1) 10Btullis: Fix error in the motd definition for the karapace hosts [puppet] - 10https://gerrit.wikimedia.org/r/936755 [15:28:45] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42377/console" [puppet] - 10https://gerrit.wikimedia.org/r/936755 (owner: 10Btullis) [15:29:40] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1530). [15:30:29] !log homer "cr*-eqiad*" commit "Gerrit: 936720 add new DNS host dns1006" [15:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:03] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42378/console" [puppet] - 10https://gerrit.wikimedia.org/r/936755 (owner: 10Btullis) [15:31:22] (03Abandoned) 10Btullis: karapace: switch karapace to use kafka-jumbo1001 [puppet] - 10https://gerrit.wikimedia.org/r/787112 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:32:22] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936754 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:32:46] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix error in the motd definition for the karapace hosts [puppet] - 10https://gerrit.wikimedia.org/r/936755 (owner: 10Btullis) [15:32:56] 10SRE-swift-storage, 10Observability-Metrics: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) @MatthewVernon @Eevans please let me know what you think of the above proposal. I was imagining the final state to be `thanos-fe` / `thanos-be` running only Swif... [15:33:35] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936754 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:40:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:41:48] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10aborrero) [15:41:57] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): rename cloudswift1001 as cloudlb1001 - https://phabricator.wikimedia.org/T341200 (10aborrero) 05In progress→03Resolved [15:42:29] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [15:45:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:46:49] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:936654| Bumping portals to master (T128546)]] (duration: 06m 31s) [15:46:54] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:50:22] (03PS1) 10Fabfur: hiera: removed dns1002 and dns1003 [puppet] - 10https://gerrit.wikimedia.org/r/936756 (https://phabricator.wikimedia.org/T326685) [15:51:08] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:20] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:936654| Bumping portals to master (T128546)]] (duration: 06m 30s) [15:53:23] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:54:34] (03CR) 10Clare Ming: [C: 03+1] "lgtm - thanks for all your work on this \o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [15:54:50] (03CR) 10Ssingh: [C: 03+1] "Let's hold on merging this till we have moved ns0." [puppet] - 10https://gerrit.wikimedia.org/r/936756 (https://phabricator.wikimedia.org/T326685) (owner: 10Fabfur) [15:55:10] (03CR) 10Ssingh: [C: 03+1] "(LGTM otherwise!)" [puppet] - 10https://gerrit.wikimedia.org/r/936756 (https://phabricator.wikimedia.org/T326685) (owner: 10Fabfur) [15:56:08] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:57:54] (03PS2) 10Majavah: wikitech: Update codfw1dev LDAP server hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936751 [15:57:59] (03PS2) 10Majavah: Disable UrlShortener on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936752 (https://phabricator.wikimedia.org/T341470) [15:58:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936751 (owner: 10Majavah) [15:58:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936752 (https://phabricator.wikimedia.org/T341470) (owner: 10Majavah) [15:58:55] (03Merged) 10jenkins-bot: wikitech: Update codfw1dev LDAP server hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936751 (owner: 10Majavah) [15:59:02] (03Merged) 10jenkins-bot: Disable UrlShortener on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936752 (https://phabricator.wikimedia.org/T341470) (owner: 10Majavah) [15:59:17] !log taavi@deploy1002 Started scap: Backport for [[gerrit:936751|wikitech: Update codfw1dev LDAP server hostname]], [[gerrit:936752|Disable UrlShortener on wikitech (T341470)]] [15:59:21] T341470: UrlShortener throws DBConnectionError exception on wikitech - https://phabricator.wikimedia.org/T341470 [16:00:28] (03CR) 10JMeybohm: [C: 04-1] k8s::proxy: Start kube-proxy after ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915461 (owner: 10Clément Goubert) [16:00:46] !log taavi@deploy1002 taavi: Backport for [[gerrit:936751|wikitech: Update codfw1dev LDAP server hostname]], [[gerrit:936752|Disable UrlShortener on wikitech (T341470)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [16:01:23] (03CR) 10Clare Ming: [C: 03+1] log additional events on Special:Diff|MobileDiff (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [16:02:26] 10SRE, 10Infrastructure-Foundations, 10netops: BFD flapping from cloudsw1-c8-eqiad (QFX5100) - https://phabricator.wikimedia.org/T341466 (10cmooney) 05Open→03Resolved Session to cloudlb1001 is stable after over an hour so think this is good to close now with the fix of using longer timers ` cmooney@cloud... [16:03:52] (03PS1) 10Fabfur: dns: remove dns1002 and 1003 [homer/public] - 10https://gerrit.wikimedia.org/r/936757 (https://phabricator.wikimedia.org/T326685) [16:05:20] (03CR) 10Jbond: [V: 03+1 C: 04-2] "this is currently not working" [puppet] - 10https://gerrit.wikimedia.org/r/936281 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [16:05:41] (03PS10) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [16:07:01] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10bd808) Is there any particular reason that the "[ ] Wikitech is ideal to dogfood mw-on-k8s, there are challenges though that we need to over come T292707" step w... [16:07:05] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:936751|wikitech: Update codfw1dev LDAP server hostname]], [[gerrit:936752|Disable UrlShortener on wikitech (T341470)]] (duration: 07m 47s) [16:07:09] T341470: UrlShortener throws DBConnectionError exception on wikitech - https://phabricator.wikimedia.org/T341470 [16:07:36] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42379/console" [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [16:09:31] (03PS11) 10JMeybohm: k8s::apiserver: Implement kube-apiserver reload [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) [16:11:12] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42380/console" [puppet] - 10https://gerrit.wikimedia.org/r/936666 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [16:13:40] PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:15:12] RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:15:40] (03CR) 10Nskaggs: "Thank you for setting larger quotas. +1 to encouraging people to migrate with a better offering, and part of that is a bigger quota." [puppet] - 10https://gerrit.wikimedia.org/r/936747 (https://phabricator.wikimedia.org/T338937) (owner: 10Andrew Bogott) [16:15:42] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:17] (03PS1) 10Jbond: rsyslog::receiver: update docs and add types [puppet] - 10https://gerrit.wikimedia.org/r/936762 (https://phabricator.wikimedia.org/T340741) [16:19:20] (03PS1) 10Jbond: rsyslog: update to use pki certificates [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) [16:21:50] (03CR) 10CI reject: [V: 04-1] rsyslog: update to use pki certificates [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [16:23:57] (03PS1) 10Hnowlan: api-gateway: emit no-cache unless otherwise asked [deployment-charts] - 10https://gerrit.wikimedia.org/r/936765 (https://phabricator.wikimedia.org/T338916) [16:25:39] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@8fa416b]: T328276: Change articletopic source to the outlink model [16:25:42] T328276: Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 [16:25:59] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@8fa416b]: T328276: Change articletopic source to the outlink model (duration: 00m 20s) [16:26:40] !log ns0: set routing-options static route 208.80.154.238/32 next-hop [ 208.80.154.6 208.80.154.153 208.80.154.77 ] [16:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:16] (03PS2) 10Hnowlan: api-gateway: emit no-cache unless otherwise asked [deployment-charts] - 10https://gerrit.wikimedia.org/r/936765 (https://phabricator.wikimedia.org/T338916) [16:31:15] 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [16:31:26] 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) p:05Triage→03Medium [16:35:05] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10Andrew) I'm fine with making things more verbose for now, then we can trim out things that... [16:39:55] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) 05Resolved→03Open [16:41:57] (03CR) 10Ssingh: [C: 03+2] dns: remove dns1002 and 1003 [homer/public] - 10https://gerrit.wikimedia.org/r/936757 (https://phabricator.wikimedia.org/T326685) (owner: 10Fabfur) [16:42:27] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10Jgiannelos) From comms with wikiwand: It seems User-Agent and Api-User-Agent (for client-side requests) are ignored, can you p... [16:42:40] (03PS1) 10Krinkle: webperf: Set XHGUI_PDO_INITSCHEMA=0 to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767 [16:44:10] !log homer "cr*-eqiad*" commit "Gerrit: 936757 remove DNS hosts dns1002 and dns1003" [16:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:56] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): spicerack: update spicrack to work with the newer puppet infrastructre - https://phabricator.wikimedia.org/T341496 (10jbond) [16:47:41] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): spicerack: update spicerack to work with the newer puppet infrastructure - https://phabricator.wikimedia.org/T341496 (10Volans) p:05Triage→03Medium [16:47:47] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructre - https://phabricator.wikimedia.org/T341497 (10jbond) [16:48:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructre - https://phabricator.wikimedia.org/T341497 (10jbond) p:05Triage→03Medium [16:49:54] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [16:50:11] (03CR) 10Ssingh: [C: 03+2] hiera: removed dns1002 and dns1003 [puppet] - 10https://gerrit.wikimedia.org/r/936756 (https://phabricator.wikimedia.org/T326685) (owner: 10Fabfur) [16:50:50] (03CR) 10Jsn.sherman: log additional events on Special:Diff|MobileDiff (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [16:52:04] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) `Wikiwand/0.1 (https://www.wikiwand.com; admin@wikiwand.com)` added to the list of user-agents. Please advise if it... [16:52:53] !log rolling restart of ntp.service on A:dns-rec [16:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1700) [17:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T1700). [17:11:07] (03PS2) 10Krinkle: webperf: Set XHGUI_PDO_INITSCHEMA=0 to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767 (https://phabricator.wikimedia.org/T341499) [17:11:26] (03PS3) 10Krinkle: webperf: Set XHGUI_PDO_INITSCHEMA=0 to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767 (https://phabricator.wikimedia.org/T341499) [17:11:33] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/936767 (https://phabricator.wikimedia.org/T341499) (owner: 10Krinkle) [17:14:24] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T341503 (10phaultfinder) [17:15:00] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/936329/42383/planet1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/936329 (owner: 10Dzahn) [17:15:28] (03Abandoned) 10Dzahn: planet: remove buster support [puppet] - 10https://gerrit.wikimedia.org/r/936331 (owner: 10Dzahn) [17:16:08] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:48] (03CR) 10Dzahn: [C: 03+2] miscweb: add statictendril release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:19:36] (03Merged) 10jenkins-bot: miscweb: add statictendril release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:21:14] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10serviceops, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) [17:24:49] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [17:26:22] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:32:10] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2022/2023-Q4): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [17:32:17] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2022/2023-Q4): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [17:33:26] (03CR) 10Dzahn: "curl against staging cluster looks good: https://phabricator.wikimedia.org/T340182#9002597" [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:33:41] (03PS1) 10Jbond: puppet: drop PuppetHosts.get_ca_servers [software/spicerack] - 10https://gerrit.wikimedia.org/r/936774 (https://phabricator.wikimedia.org/T341496) [17:47:08] (03PS1) 10Dzahn: miscweb: add statictendril to eqiad and codfw k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/936775 (https://phabricator.wikimedia.org/T340182) [17:47:46] (03PS1) 10Ssingh: common.yaml: remove dns1002 and dns1003 from ntp_peers [homer/public] - 10https://gerrit.wikimedia.org/r/936776 (https://phabricator.wikimedia.org/T326685) [17:48:59] (03CR) 10Ssingh: [C: 03+2] common.yaml: remove dns1002 and dns1003 from ntp_peers [homer/public] - 10https://gerrit.wikimedia.org/r/936776 (https://phabricator.wikimedia.org/T326685) (owner: 10Ssingh) [17:51:35] !log homer "mr*" commit "update ntp_servers (remove dns100[2-3], add dns100[5-6])" [17:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:12] (03CR) 10Dzahn: [C: 03+2] miscweb: add statictendril to eqiad and codfw k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/936775 (https://phabricator.wikimedia.org/T340182) (owner: 10Dzahn) [17:55:13] (03Merged) 10jenkins-bot: miscweb: add statictendril to eqiad and codfw k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/936775 (https://phabricator.wikimedia.org/T340182) (owner: 10Dzahn) [17:55:40] !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:59:30] (03PS2) 10Ladsgroup: sre.mysql.clone: Only encrypt data transfers between DCs [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 [18:00:33] (03CR) 10Ladsgroup: sre.mysql.clone: Only encrypt data transfers between DCs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup) [18:02:17] (03CR) 10CI reject: [V: 04-1] sre.mysql.clone: Only encrypt data transfers between DCs [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 (owner: 10Ladsgroup) [18:03:13] !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:06:54] (03CR) 10Michael Große: Beta-Wikidata: Always show mul on desktop Termbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/935770 (https://phabricator.wikimedia.org/T339104) (owner: 10Michael Große) [18:10:20] (03PS3) 10Ladsgroup: sre.mysql.clone: Only encrypt data transfers between DCs [cookbooks] - 10https://gerrit.wikimedia.org/r/936287 [18:13:40] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:04] 10ops-eqiad, 10Traffic: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10ssingh) [18:14:30] (03PS1) 10Jbond: puppet: Add versions method which will return the version of the agnts [software/spicerack] - 10https://gerrit.wikimedia.org/r/936781 (https://phabricator.wikimedia.org/T341496) [18:14:32] (03PS1) 10Jbond: WIP:puppet: Add support for puppetserver v7 [software/spicerack] - 10https://gerrit.wikimedia.org/r/936782 [18:17:58] (03CR) 10CI reject: [V: 04-1] WIP:puppet: Add support for puppetserver v7 [software/spicerack] - 10https://gerrit.wikimedia.org/r/936782 (owner: 10Jbond) [18:18:21] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:18:23] (03CR) 10CI reject: [V: 04-1] puppet: Add versions method which will return the version of the agnts [software/spicerack] - 10https://gerrit.wikimedia.org/r/936781 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [18:25:44] (03CR) 10Fabfur: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [18:26:14] !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:29:06] !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:31:32] !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:32:38] !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:35:49] 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10BCornwall) 05Open→03Stalled a:03BBlack This was under the request of @BBlack - I believe the intention was that this would be "good enough" for t... [18:37:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1012.eqiad.wmnet [18:38:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns[1002-1003].wikimedia.org [18:40:24] (03PS1) 10Dzahn: trafficserver: switch dbtree/tendril to k8s backend [puppet] - 10https://gerrit.wikimedia.org/r/936785 (https://phabricator.wikimedia.org/T340182) [18:40:44] (03CR) 10CI reject: [V: 04-1] trafficserver: switch dbtree/tendril to k8s backend [puppet] - 10https://gerrit.wikimedia.org/r/936785 (https://phabricator.wikimedia.org/T340182) (owner: 10Dzahn) [18:41:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:42:56] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/935885 (https://phabricator.wikimedia.org/T341511) [18:43:39] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [18:44:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 37 hosts with reason: Primary switchover s1 T341511 [18:44:50] T341511: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T341511 [18:45:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 37 hosts with reason: Primary switchover s1 T341511 [18:45:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2103 with weight 0 T341511', diff saved to https://phabricator.wikimedia.org/P49535 and previous config saved to /var/cache/conftool/dbconfig/20230710-184521-ladsgroup.json [18:46:12] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [18:46:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:46:20] 10SRE, 10ops-eqiad, 10Traffic: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10ssingh) [18:47:06] 10SRE, 10ops-eqiad, 10Traffic: eqiad dns100[1-3] unified decommission task - https://phabricator.wikimedia.org/T341507 (10ssingh) The hosts have been decomissioned and ready for the hardware part. [18:47:54] Amir1: ok to remove dbproxy entries? [18:47:57] -134 1H IN PTR dbproxy1012.eqiad.wmnet. [18:48:20] 14:37:38 <+logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1012.eqiad.wmnet [18:48:23] DNS changes from here [18:48:52] 18:39 < Amir1> Hey, I'm decommissioning dbproxy10[12-17] and they are mentioned in two helm charts: [18:49:16] yeah should be fine, plus the cookbook already ran by now! [18:49:17] thanks [18:49:53] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns[1002-1003].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [18:49:54] !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:49:55] !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbproxy1012.eqiad.wmnet [18:49:58] unless something is unhappy when names dont resolve at all.. vs host being just unreahcable [18:50:03] uh oh [18:50:18] which I guess is expected [18:50:25] the uh oh was for the failure above :) [18:50:27] sorry I missed this [18:50:31] np, resolved [18:50:46] I am going to remove the rest of the dbproxy stuff too :> [18:50:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns[1002-1003].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [18:50:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:50:55] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns[1002-1003].wikimedia.org [18:51:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1012.eqiad.wmnet [18:51:03] 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns[1002-1003].wikimedia.org` - dns1002.wikimedia.org (**WARN**) - Downtimed host on Icinga/Alertmanager - Found ph... [18:51:38] Amir1: did I break your cookbook? [18:51:41] sorry if I did [18:51:44] what was the error you got? [18:54:41] 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) [18:55:08] 10SRE, 10Traffic: Reduce toil in provisioning and decommissioning of DNS/NTP servers by automating generation of resolv.conf and NTP peers - https://phabricator.wikimedia.org/T340479 (10ssingh) [18:55:29] 10SRE, 10Traffic: Q4:rack/setup/install dns100[456] - https://phabricator.wikimedia.org/T326685 (10ssingh) 05In progress→03Resolved Traffic has commissioned these boxes. Many thanks to dc-ops! [18:55:49] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [18:56:04] !log finished commissionioning new DNS hosts in eqiad: dns100[4-6]. decomissioned dns100[1-3]. [18:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:57:01] (03PS1) 10Ssingh: templates: dummy commit to test new DNS boxes [dns] - 10https://gerrit.wikimedia.org/r/936787 [18:57:02] !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbproxy1012.eqiad.wmnet [18:58:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [18:59:30] (03CR) 10Ssingh: [C: 03+2] templates: dummy commit to test new DNS boxes [dns] - 10https://gerrit.wikimedia.org/r/936787 (owner: 10Ssingh) [18:59:42] !log running authdns-update [18:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:09:28] (03PS2) 10Dzahn: trafficserver: switch dbtree/tendril to k8s backend [puppet] - 10https://gerrit.wikimedia.org/r/936785 (https://phabricator.wikimedia.org/T340182) [19:10:13] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/935885 (https://phabricator.wikimedia.org/T341511) (owner: 10Gerrit maintenance bot) [19:11:00] (03CR) 10Dzahn: [C: 03+2] trafficserver: switch dbtree/tendril to k8s backend [puppet] - 10https://gerrit.wikimedia.org/r/936785 (https://phabricator.wikimedia.org/T340182) (owner: 10Dzahn) [19:12:06] !log Starting s1 codfw failover from db2112 to db2103 - T341511 [19:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:10] T341511: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T341511 [19:13:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2103 to s1 primary T341511', diff saved to https://phabricator.wikimedia.org/P49536 and previous config saved to /var/cache/conftool/dbconfig/20230710-191259-ladsgroup.json [19:15:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2112 T341511', diff saved to https://phabricator.wikimedia.org/P49537 and previous config saved to /var/cache/conftool/dbconfig/20230710-191511-ladsgroup.json [19:17:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [19:17:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [19:18:34] (03PS1) 10Ladsgroup: ExternalLinks: Make oneWildcard avoid adding wildcard to domain [core] (wmf/1.41.0-wmf.16) - 10https://gerrit.wikimedia.org/r/936733 (https://phabricator.wikimedia.org/T326251) [19:20:41] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1012.eqiad.wmnet - https://phabricator.wikimedia.org/T341510 (10Ladsgroup) a:05Ladsgroup→03wiki_willy [19:21:05] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1012.eqiad.wmnet - https://phabricator.wikimedia.org/T341510 (10Ladsgroup) The cookbook was a bit messy but it should be done now [19:21:13] (03Abandoned) 10Dzahn: miscweb: add release statictendril to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/930887 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [19:21:57] (03PS2) 10Dzahn: miscweb: remove static_tendril classes and files [puppet] - 10https://gerrit.wikimedia.org/r/932337 (https://phabricator.wikimedia.org/T300171) [19:23:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [19:23:47] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1012.eqiad.wmnet - https://phabricator.wikimedia.org/T341510 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [19:23:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [19:33:56] (03CR) 10Hashar: [C: 03+1] ci/zuul: set contint2002 as the active ci::manager_host [puppet] - 10https://gerrit.wikimedia.org/r/935919 (https://phabricator.wikimedia.org/T324659) (owner: 10Jelto) [19:35:20] (03PS2) 10Samtar: log additional events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [19:40:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P49538 and previous config saved to /var/cache/conftool/dbconfig/20230710-194022-ladsgroup.json [19:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [19:52:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [19:52:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [19:55:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P49540 and previous config saved to /var/cache/conftool/dbconfig/20230710-195527-ladsgroup.json [19:59:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1124.eqiad.wmnet with reason: Reboot [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T2000). [20:00:06] JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1124.eqiad.wmnet with reason: Reboot [20:00:19] * TheresNoTime can deploy [20:00:34] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Split Thanos components from thanos-fe hosts - https://phabricator.wikimedia.org/T341488 (10Eevans) >>! In T341488#9001995, @fgiunchedi wrote: > @MatthewVernon @Eevans please let me know what you think of the above proposal. I was imagining the... [20:01:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [20:01:57] (03Merged) 10jenkins-bot: log additional events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [20:02:13] !log samtar@deploy1002 Started scap: Backport for [[gerrit:936748|log additional events on Special:Diff|MobileDiff (T326212)]] [20:02:18] T326212: Improve data logging on Special:Diff and Special:MobileDiff - https://phabricator.wikimedia.org/T326212 [20:03:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:03:35] !log samtar@deploy1002 samtar and jsn: Backport for [[gerrit:936748|log additional events on Special:Diff|MobileDiff (T326212)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:03:47] JSherman: can you test this change on mwdebug? [20:03:59] wilco [20:08:38] TheresNoTime: So I'm navigating diffs with the debug extension, and then checking https://stream.wikimedia.org/v1/stream/mediawiki.special_diff_interactions but I'm not seeing anything. Maybe I don't know how to access production events? [20:09:52] I'm not seeing any events on https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002?_g=h@8daf61d&_a=h@7f0701a, are you sure you're using a mwdebug server via https://wikitech.wikimedia.org/wiki/WikimediaDebug ? [20:09:54] oh, helps to use the right url: https://stream.wikimedia.org/v2/stream/mediawiki.special_diff_interactions but I'm getting stream not found [20:10:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P49541 and previous config saved to /var/cache/conftool/dbconfig/20230710-201031-ladsgroup.json [20:11:23] JSherman: how does stream pick it up? [20:11:43] I'm sure this isn't the first time new events haven't been noticed from debug [20:12:21] (03CR) 10Dzahn: [C: 03+2] miscweb: remove static_tendril classes and files [puppet] - 10https://gerrit.wikimedia.org/r/932337 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [20:13:32] (03PS4) 10Krinkle: webperf: Set XHGUI_PDO_INITSCHEMA=0 to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767 [20:13:52] (03PS5) 10Krinkle: webperf: Set XHGUI_PDO_INITSCHEMA=false to avoid 'CREATE TABLE' fatal [puppet] - 10https://gerrit.wikimedia.org/r/936767 [20:14:00] RhinosF1: that is a good question that I don't know the answer to. I'm realizing that I may be coming into this too naively. When I deployed this to beta, I was just able to curl https://stream-beta.wmflabs.org/v2/stream/mediawiki.special_diff_interactions and get log events [20:14:33] !log miscweb1003/miscweb2003 - rm -rf /srv/org/wikimedia/static-tendril [20:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:08] JSherman: is there anyone online or can we verify it won't break anything else if we were to sync and wait a few minutes [20:15:20] Assuming TheresNoTime is comfortable [20:16:00] JSherman: I'm not seeing any obvious errors, what's the risks of syncing? I do note that https://stream-beta.wmflabs.org/?doc#/streams lists the stream, whereas https://stream.wikimedia.org/?doc#/streams does not [20:16:46] yeah, with beta, the stream wasn't created/available until there were events in the topic [20:17:15] PROBLEM - Host parse1012 is DOWN: PING CRITICAL - Packet loss = 100% [20:17:19] okay, makes sense — I'm happy to sync this and revert if needed [20:18:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:18:02] TheresNoTime: I appreciate that; I'm ready to test [20:18:11] syncing [20:18:15] RECOVERY - Host parse1012 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:19:18] !log syncing https://gerrit.wikimedia.org/r/c/936748 untested (T326212) for test after sync [20:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:22] T326212: Improve data logging on Special:Diff and Special:MobileDiff - https://phabricator.wikimedia.org/T326212 [20:23:53] !log bking@wdqs1006 Restart wdqs-blazegraph to hopefully clear the free allocators alerts [20:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:56] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:936748|log additional events on Special:Diff|MobileDiff (T326212)]] (duration: 21m 42s) [20:24:14] JSherman: okay, please test [20:25:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2112 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P49544 and previous config saved to /var/cache/conftool/dbconfig/20230710-202536-ladsgroup.json [20:26:10] TheresNoTime: hmm, still not seeing anything, though it's my understanding that there can be some lag [20:26:41] okay, I'll keep an eye for errors but let's leave it 15 minutes? [20:27:07] sounds good; I'll be clicking around and curling in the mean time. [20:28:25] (03PS1) 10Btullis: Configure datahub staging to use the new karapace instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/936791 (https://phabricator.wikimedia.org/T329514) [20:31:40] TheresNoTime: It looks like my instrument isn't posting, though I can see readers instrument is posting just fine. [20:31:59] JSherman: hm, would you like to revert? [20:32:25] yeah, let's do that; I'll go back and try to sort out why that's happening. [20:32:51] (03PS1) 10Samtar: Revert "log additional events on Special:Diff|MobileDiff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936735 [20:33:01] (03CR) 10Btullis: [C: 03+2] Configure datahub staging to use the new karapace instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/936791 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [20:33:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936735 (owner: 10Samtar) [20:33:45] (03Merged) 10jenkins-bot: Configure datahub staging to use the new karapace instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/936791 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [20:34:28] (03Merged) 10jenkins-bot: Revert "log additional events on Special:Diff|MobileDiff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936735 (owner: 10Samtar) [20:34:46] !log samtar@deploy1002 Started scap: Backport for [[gerrit:936735|Revert "log additional events on Special:Diff|MobileDiff"]] [20:36:11] !log samtar@deploy1002 samtar: Backport for [[gerrit:936735|Revert "log additional events on Special:Diff|MobileDiff"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:36:24] (syncing, forgot to bypass that) [20:37:37] TheresNoTime: thanks for your deployment & reversion efforts! [20:37:48] No worries, sorry it didn't work out! :D [20:39:39] (03CR) 10BCornwall: [V: 03+1 C: 03+2] fifo_log_demux: Fix systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez) [20:40:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:40:21] (03CR) 10Jforrester: [DNM][WIP] Initial configuration for Wikifunctions.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [20:40:27] (03PS3) 10Jforrester: Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219) [20:40:29] (03PS5) 10Jforrester: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) [20:40:31] (03PS5) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [20:40:33] (03PS4) 10Jforrester: [DNM][WIP] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [20:41:00] Thanks for trying TheresNoTime [20:41:11] ^^ [20:41:28] (03PS1) 10Btullis: Configure the test datahub jobs to use the staging schema registry [puppet] - 10https://gerrit.wikimedia.org/r/936792 (https://phabricator.wikimedia.org/T329514) [20:42:10] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [20:42:13] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:936735|Revert "log additional events on Special:Diff|MobileDiff"]] (duration: 07m 27s) [20:43:12] !log close UTC late backport window [20:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:46:09] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [20:53:10] (03CR) 10Clare Ming: "sorry i missed this -- just noticed you had to revert -- i think it's because you didn't define a sampling rate in your production stream" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936748 (https://phabricator.wikimedia.org/T326212) (owner: 10Jsn.sherman) [20:54:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:59:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230710T2100). [21:00:30] oh that one is baaaad [21:02:39] (03PS2) 10Jdlrobson: Logos: Fixes grantswiki and idwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936097 [21:14:19] (03CR) 10BCornwall: [C: 03+1] hieradata: labweb: update lvs pool to reference the ssl service [puppet] - 10https://gerrit.wikimedia.org/r/831173 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [21:14:29] (03CR) 10BCornwall: [C: 03+1] service: remove plaintext labweb service (I) [puppet] - 10https://gerrit.wikimedia.org/r/831174 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [21:14:35] (03CR) 10BCornwall: [C: 03+1] service: remove plaintext labweb service (II) [puppet] - 10https://gerrit.wikimedia.org/r/831175 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [21:15:42] (03CR) 10BCornwall: [C: 03+1] "LGTM, nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/831176 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [21:18:14] 10SRE, 10Infrastructure Security: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10Aklapper) [21:22:39] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) [21:25:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:30:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:33:08] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:33:18] (03PS1) 10Btullis: Permit staging datahub to access karapace1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936793 (https://phabricator.wikimedia.org/T329514) [21:34:57] (03CR) 10Btullis: [C: 03+2] Permit staging datahub to access karapace1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936793 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [21:35:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:35:44] (03Merged) 10jenkins-bot: Permit staging datahub to access karapace1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/936793 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [21:35:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:36:53] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [21:37:05] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.628 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:46] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 52s) [21:38:05] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [21:38:08] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:38:19] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [21:39:33] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [21:42:05] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [22:10:22] (03PS2) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [22:12:06] !log Deployed security patch for T340200 [22:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:00] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42384/console" [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [22:19:19] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:21:44] (03PS3) 10Fabfur: hiera: add silent-drop directives for http frontend [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) [22:33:54] (03PS1) 10Ladsgroup: Override liftwing hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936796 (https://phabricator.wikimedia.org/T319170) [22:34:50] (03CR) 10Majavah: "Shouldn't this be using a service proxy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936796 (https://phabricator.wikimedia.org/T319170) (owner: 10Ladsgroup) [22:45:55] (03CR) 10Fabfur: hiera: add silent-drop directives for http frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936701 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur) [22:58:14] (03PS2) 10Bartosz Dziewoński: Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696) [23:07:16] (03PS1) 10BryanDavis: toolforge: Add more CORS headers to docker registry [puppet] - 10https://gerrit.wikimedia.org/r/936797 (https://phabricator.wikimedia.org/T232135) [23:11:30] !log krinkle@xhgui1001$ Define new `xhgui.watches` table via xhguiadmin@m2-master.eqiad.wmnet database, ref T341499 [23:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:34] T341499: Upgrade XHGui from 0.14.0 to latest (0.21.3) - https://phabricator.wikimedia.org/T341499 [23:13:36] (03CR) 10Krinkle: [V: 03+1] "I've tested this in Beta Cluster first, both on the version currently in production via performance/docroot.git (xhgui 0.14.0), and with t" [puppet] - 10https://gerrit.wikimedia.org/r/936767 (owner: 10Krinkle) [23:13:59] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/output/936797/42386/" [puppet] - 10https://gerrit.wikimedia.org/r/936797 (https://phabricator.wikimedia.org/T232135) (owner: 10BryanDavis) [23:48:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer