[00:03:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1153.eqiad.wmnet with reason: host reimage [00:05:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:10:04] (03CR) 10Cwhite: [C: 03+2] logstash: remove k8s stats-exporter cloning [puppet] - 10https://gerrit.wikimedia.org/r/937603 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [00:12:10] 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10andrea.denisse) Hi @Mpossoupe , I've added you to the 'wmf' LDAP group. ` denisse@mwmaint1002:~$ ldapsearch -A -x member=uid=mpossoupe,ou=people,dc=wikimedia,dc=org dn # extended LDIF # # LDAPv... [00:13:24] 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10andrea.denisse) 05Open→03Invalid Closing in favor of T342335. [00:18:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:25:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:25:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1153.eqiad.wmnet with OS bullseye [00:25:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1153.eqiad.wmnet with OS bullseye completed: - an-worker1153 (*... [00:26:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [00:26:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) finished 53-56. should have time to finish the last 4 tomorrow afternoon [00:33:03] (03PS1) 10Andrea Denisse: groups: Add taavi to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/940269 (https://phabricator.wikimedia.org/T342307) [00:36:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10andrea.denisse) [00:37:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10andrea.denisse) 05Open→03In progress [00:38:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940207 [00:38:49] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940207 (owner: 10TrainBranchBot) [01:10:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940207 (owner: 10TrainBranchBot) [01:33:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) Thank you [01:43:14] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:43:22] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:14] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:28] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:06] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:18] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:49:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:26] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=active; selector: name=wdqs222([0-1])\.codfw\.wmnet [03:50:50] oops, slightly wrong syntax [03:51:13] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: name=wdqs202([1-2])\.codfw\.wmnet [03:53:25] (03PS1) 10Ryan Kemper: wdqs: re-enable alerting on last 2 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/940272 (https://phabricator.wikimedia.org/T332314) [03:55:06] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: re-enable alerting on last 2 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/940272 (https://phabricator.wikimedia.org/T332314) (owner: 10Ryan Kemper) [03:55:23] (03CR) 10Ryan Kemper: [C: 03+2] "Self-merging since (a) change is simple and (b) these hosts have been brought into service now so we want alerts firing." [puppet] - 10https://gerrit.wikimedia.org/r/940272 (https://phabricator.wikimedia.org/T332314) (owner: 10Ryan Kemper) [04:00:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:22] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 0:01:00 on 10 hosts with reason: trying to remove downtime on these new hosts [04:00:30] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:01:00 on 10 hosts with reason: trying to remove downtime on these new hosts [05:08:30] (03PS1) 10Marostegui: db1208: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940273 (https://phabricator.wikimedia.org/T334055) [05:09:54] (03CR) 10Marostegui: [C: 03+2] db1208: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940273 (https://phabricator.wikimedia.org/T334055) (owner: 10Marostegui) [05:40:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:45:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:46:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:56:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230721T0600) [06:02:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:07:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:08:04] (03CR) 10Tim Starling: [C: 03+1] Profiler: Remove "toobig" filter from Arc Lamp ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939755 (https://phabricator.wikimedia.org/T337873) (owner: 10Krinkle) [06:16:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:21:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:28:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:33:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:34:31] (03PS1) 10Marostegui: db1208: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/940276 (https://phabricator.wikimedia.org/T334055) [06:36:56] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 139148 [06:36:57] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 139148 [06:37:20] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 139418 [06:37:37] (03CR) 10Marostegui: [C: 03+2] db1208: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/940276 (https://phabricator.wikimedia.org/T334055) (owner: 10Marostegui) [06:38:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 139418 [06:39:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 398203 [06:40:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398203 [06:40:40] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3209 [06:40:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3209 [06:45:47] (03CR) 10Marostegui: [C: 03+1] spicerack: Add config file for MySQL/MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/939302 (owner: 10Ladsgroup) [06:52:56] (03PS2) 10Giuseppe Lavagetto: mediawiki: add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/940189 (https://phabricator.wikimedia.org/T342356) [06:52:58] (03PS2) 10Giuseppe Lavagetto: admin: add mw-misc namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/940198 (https://phabricator.wikimedia.org/T341859) [06:53:00] (03PS2) 10Giuseppe Lavagetto: mw-misc: add deployment with support for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199 [06:54:16] (03CR) 10CI reject: [V: 04-1] mw-misc: add deployment with support for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199 (owner: 10Giuseppe Lavagetto) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230721T0700) [07:01:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2171 (s5 and s6)', diff saved to https://phabricator.wikimedia.org/P49620 and previous config saved to /var/cache/conftool/dbconfig/20230721-070110-root.json [07:02:27] (03PS1) 10Marostegui: db2171: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940316 (https://phabricator.wikimedia.org/T334650) [07:03:15] ACKNOWLEDGEMENT - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service,wmf_auto_restart_prometheus-mysqld-exporter@matomo.service Marostegui Host will be decommissioned https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:42] (03CR) 10Marostegui: [C: 03+2] db2171: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940316 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [07:09:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM; do we want to merge it?" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [07:10:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:10:59] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42618/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [07:11:17] (03PS1) 10Marostegui: dbstore1005: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940317 (https://phabricator.wikimedia.org/T334652) [07:11:51] (03CR) 10Marostegui: [C: 03+2] dbstore1005: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940317 (https://phabricator.wikimedia.org/T334652) (owner: 10Marostegui) [07:11:56] (03CR) 10Giuseppe Lavagetto: kubernetes::master: Add confd config writing all sa certs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [07:12:33] !log Upgrade dbstore1005 to mariadb 10.6 T334652 [07:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:37] T334652: Migrate dbstore1005 to MariaDB 10.6 - https://phabricator.wikimedia.org/T334652 [07:14:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49621 and previous config saved to /var/cache/conftool/dbconfig/20230721-071422-root.json [07:14:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49622 and previous config saved to /var/cache/conftool/dbconfig/20230721-071430-root.json [07:16:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1201', diff saved to https://phabricator.wikimedia.org/P49623 and previous config saved to /var/cache/conftool/dbconfig/20230721-071623-root.json [07:17:23] (03PS1) 10Marostegui: db1201: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940318 (https://phabricator.wikimedia.org/T334650) [07:17:56] (03CR) 10Marostegui: [C: 03+2] db1201: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/940318 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [07:20:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49624 and previous config saved to /var/cache/conftool/dbconfig/20230721-072052-root.json [07:25:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Marostegui) partman recipe assigned [07:26:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:27:30] (03PS1) 10Marostegui: pc2015,pc2016: New hosts [puppet] - 10https://gerrit.wikimedia.org/r/940319 (https://phabricator.wikimedia.org/T342163) [07:28:48] (03PS2) 10Marostegui: pc2015,pc2016: New hosts [puppet] - 10https://gerrit.wikimedia.org/r/940319 (https://phabricator.wikimedia.org/T342163) [07:29:25] (03CR) 10Marostegui: [C: 03+2] pc2015,pc2016: New hosts [puppet] - 10https://gerrit.wikimedia.org/r/940319 (https://phabricator.wikimedia.org/T342163) (owner: 10Marostegui) [07:29:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49625 and previous config saved to /var/cache/conftool/dbconfig/20230721-072927-root.json [07:29:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49626 and previous config saved to /var/cache/conftool/dbconfig/20230721-072935-root.json [07:31:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:33:31] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:33:36] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:33:42] mmmm gerrit down? [07:33:45] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:33:47] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:33:49] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) For db1131, it is a master. When do you plan to do this? I'd need a couple of days to remove its master role. [07:34:53] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 74583 bytes in 0.040 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:35:07] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Sun 08 Oct 2023 09:52:13 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:35:09] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.028 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:35:52] didn't do anything, auto-recovered [07:35:56] cc: hashar: --^ [07:35:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49627 and previous config saved to /var/cache/conftool/dbconfig/20230721-073557-root.json [07:38:36] (ProbeDown) resolved: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:28] (03CR) 10Elukey: "Very nice job 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [07:43:59] jouncebot: nowandnext [07:43:59] For the next 23 hour(s) and 16 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230721T0700) [07:43:59] In 23 hour(s) and 16 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230722T0700) [07:44:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49628 and previous config saved to /var/cache/conftool/dbconfig/20230721-074431-root.json [07:44:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49629 and previous config saved to /var/cache/conftool/dbconfig/20230721-074440-root.json [07:46:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "I'm happy to merge or let Jeff do it, whichever is easiest" [puppet] - 10https://gerrit.wikimedia.org/r/940202 (https://phabricator.wikimedia.org/T342064) (owner: 10Dwisehaupt) [07:46:26] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove thanos log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937604 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [07:47:09] 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10Mpossoupe) Hi @andrea.denisse, Confirming that I can now have access to Turnilo. Many thanks for your support. [07:50:38] !log zabe@deploy1002 Started scap: T342405 [07:50:43] T342405: $wgCampaignEventsProgramsAndEventsDashboardAPISecret no longer set in PrivateSettings.php - https://phabricator.wikimedia.org/T342405 [07:51:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49630 and previous config saved to /var/cache/conftool/dbconfig/20230721-075101-root.json [07:51:30] (03CR) 10Filippo Giunchedi: [C: 03+1] traffic: Filter cp|dns instances on HAProxy alerts [alerts] - 10https://gerrit.wikimedia.org/r/918471 (owner: 10Vgutierrez) [07:51:39] (03CR) 10Vgutierrez: [C: 03+2] traffic: Filter cp|dns instances on HAProxy alerts [alerts] - 10https://gerrit.wikimedia.org/r/918471 (owner: 10Vgutierrez) [07:54:57] (03PS1) 10Marostegui: report_users: Remove decommissioned IPs [software] - 10https://gerrit.wikimedia.org/r/940320 [07:55:43] (03CR) 10Marostegui: [C: 03+2] report_users: Remove decommissioned IPs [software] - 10https://gerrit.wikimedia.org/r/940320 (owner: 10Marostegui) [07:56:18] (03Merged) 10jenkins-bot: report_users: Remove decommissioned IPs [software] - 10https://gerrit.wikimedia.org/r/940320 (owner: 10Marostegui) [07:57:41] !log zabe@deploy1002 Finished scap: T342405 (duration: 07m 03s) [07:57:47] T342405: $wgCampaignEventsProgramsAndEventsDashboardAPISecret no longer set in PrivateSettings.php - https://phabricator.wikimedia.org/T342405 [07:59:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49631 and previous config saved to /var/cache/conftool/dbconfig/20230721-075936-root.json [07:59:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49632 and previous config saved to /var/cache/conftool/dbconfig/20230721-075944-root.json [08:06:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49633 and previous config saved to /var/cache/conftool/dbconfig/20230721-080606-root.json [08:10:06] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1005.eqiad.wmnet with OS bullseye [08:14:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49634 and previous config saved to /var/cache/conftool/dbconfig/20230721-081441-root.json [08:14:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49635 and previous config saved to /var/cache/conftool/dbconfig/20230721-081449-root.json [08:16:29] (03CR) 10Giuseppe Lavagetto: "I generally like the idea, I have one doubt about the solution we chose." [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [08:19:42] (03PS8) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [08:19:58] (03CR) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:21:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49636 and previous config saved to /var/cache/conftool/dbconfig/20230721-082111-root.json [08:27:13] (03PS1) 10Arturo Borrero Gonzalez: acme_chief: openstack-codf1dev: drop cloudcontrol access [puppet] - 10https://gerrit.wikimedia.org/r/940321 (https://phabricator.wikimedia.org/T324992) [08:29:33] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: don't deploy haproxy to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/940322 (https://phabricator.wikimedia.org/T341495) [08:29:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49637 and previous config saved to /var/cache/conftool/dbconfig/20230721-082946-root.json [08:29:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49638 and previous config saved to /var/cache/conftool/dbconfig/20230721-082954-root.json [08:30:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: openstack-codf1dev: drop cloudcontrol access [puppet] - 10https://gerrit.wikimedia.org/r/940321 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [08:31:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: don't deploy haproxy to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/940322 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [08:31:31] (03PS1) 10Alexandros Kosiaris: deployment: Support making k8s deploys db section aware [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843) [08:36:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49639 and previous config saved to /var/cache/conftool/dbconfig/20230721-083616-root.json [08:38:08] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: control: fix typo in cloud_private include [puppet] - 10https://gerrit.wikimedia.org/r/940324 (https://phabricator.wikimedia.org/T341495) [08:38:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: control: fix typo in cloud_private include [puppet] - 10https://gerrit.wikimedia.org/r/940324 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [08:41:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes::master: Publish service-account cert to etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:43:02] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [08:43:04] (03CR) 10JMeybohm: [C: 03+1] mediawiki: add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/940189 (https://phabricator.wikimedia.org/T342356) (owner: 10Giuseppe Lavagetto) [08:43:37] (03CR) 10JMeybohm: [C: 03+1] admin: add mw-misc namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/940198 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [08:44:17] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:44:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49640 and previous config saved to /var/cache/conftool/dbconfig/20230721-084450-root.json [08:44:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49641 and previous config saved to /var/cache/conftool/dbconfig/20230721-084459-root.json [08:45:01] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495" [08:45:05] T341495: eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 [08:45:15] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) [08:45:45] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495" [08:47:10] !log disabling puppet in C:cumin - T341669 [08:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:14] T341669: Allow for multiple confd instances in pupper - https://phabricator.wikimedia.org/T341669 [08:47:25] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495" [08:47:33] !log disabling puppet in C:confd - T341669 [08:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:08] (03CR) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [08:48:10] !log ignore "disabling puppet in C:cumin" - was a typo [08:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:01] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:49:05] (03CR) 10JMeybohm: [C: 03+2] kubernetes: Add etcd srv names to clusterconfig structure [puppet] - 10https://gerrit.wikimedia.org/r/937793 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:49:14] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [08:51:17] RECOVERY - MariaDB read only matomo on db1108 is OK: Version 10.4.22-MariaDB-log, Uptime 74s, read_only: True, event_scheduler: True, 11.67 QPS, connection latency: 0.003959s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:51:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49642 and previous config saved to /var/cache/conftool/dbconfig/20230721-085120-root.json [08:53:19] RECOVERY - mysqld processes on db1108 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:53:46] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudcontrol1005 - aborrero@cumin1001 - T341495" [08:53:50] T341495: eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 [08:54:09] RECOVERY - MariaDB read only analytics_meta on db1108 is OK: Version 10.4.22-MariaDB-log, Uptime 84s, read_only: True, event_scheduler: True, 17.60 QPS, connection latency: 0.008605s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:56:06] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1108.eqiad.wmnet with reason: db1108 has been replaced with db1208 - leaving for a few days before decom [08:56:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1108.eqiad.wmnet with reason: db1108 has been replaced with db1208 - leaving for a few days before decom [08:58:01] (03PS2) 10Giuseppe Lavagetto: kubernetes: add mw-misc "service" [puppet] - 10https://gerrit.wikimedia.org/r/940186 (https://phabricator.wikimedia.org/T341859) [08:58:03] (03PS1) 10Giuseppe Lavagetto: conftool::state: use confd::default_instance [puppet] - 10https://gerrit.wikimedia.org/r/940325 [08:58:16] (03PS2) 10Giuseppe Lavagetto: conftool::state: use confd::default_instance [puppet] - 10https://gerrit.wikimedia.org/r/940325 [08:59:33] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1014.eqiad.wmnet with OS bookworm [08:59:45] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42620/console" [puppet] - 10https://gerrit.wikimedia.org/r/940325 (owner: 10Giuseppe Lavagetto) [08:59:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49643 and previous config saved to /var/cache/conftool/dbconfig/20230721-085955-root.json [09:00:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49644 and previous config saved to /var/cache/conftool/dbconfig/20230721-090003-root.json [09:01:06] (03PS3) 10JMeybohm: conftool::state: use confd::default_instance [puppet] - 10https://gerrit.wikimedia.org/r/940325 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [09:02:11] (03PS2) 10Alexandros Kosiaris: deployment: Support making k8s deploys db section aware [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843) [09:04:01] (03CR) 10JMeybohm: [C: 03+2] conftool::state: use confd::default_instance [puppet] - 10https://gerrit.wikimedia.org/r/940325 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [09:06:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49645 and previous config saved to /var/cache/conftool/dbconfig/20230721-090625-root.json [09:08:55] (03CR) 10Jcrespo: "Hello, Btullis and other reviewers." [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis) [09:09:49] !log enable puppet on C:confd - T341669 [09:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:53] T341669: Allow for multiple confd instances in pupper - https://phabricator.wikimedia.org/T341669 [09:10:03] ^ arturo [09:10:17] jayme: thanks! [09:13:02] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [09:13:12] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42621/console" [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [09:18:22] (03PS6) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [09:18:24] (03PS8) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [09:18:26] (03PS8) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [09:18:35] (03PS9) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [09:19:08] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host lvs1014.eqiad.wmnet with OS bookworm [09:19:22] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1014.eqiad.wmnet with OS bookworm [09:20:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:23:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:26:34] (03PS1) 10Btullis: Fix the cephosd fsid mismatch check [puppet] - 10https://gerrit.wikimedia.org/r/940326 (https://phabricator.wikimedia.org/T330151) [09:26:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:28:32] 10SRE-OnFire, 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574 (10Gehel) p:05Triage→03Medium [09:30:13] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1014.eqiad.wmnet with reason: host reimage [09:31:30] (03PS1) 10Jbond: admin: update approveres for ops [puppet] - 10https://gerrit.wikimedia.org/r/940327 [09:31:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:32:10] (03CR) 10Jbond: [C: 03+2] admin: update approveres for ops [puppet] - 10https://gerrit.wikimedia.org/r/940327 (owner: 10Jbond) [09:33:15] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1014.eqiad.wmnet with reason: host reimage [09:36:43] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1005.eqiad.wmnet with OS bullseye [09:38:13] RECOVERY - cinder-api http on cloudcontrol1005 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 663 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:45:04] (03CR) 10Btullis: [C: 03+2] Fix the cephosd fsid mismatch check [puppet] - 10https://gerrit.wikimedia.org/r/940326 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [09:47:49] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [09:49:16] (03PS1) 10Giuseppe Lavagetto: confd::instance: support prefix in non-main instances [puppet] - 10https://gerrit.wikimedia.org/r/940329 [09:50:22] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [09:50:27] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1014.eqiad.wmnet with OS bookworm [09:52:13] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) [09:52:22] (03PS1) 10Arturo Borrero Gonzalez: openstack: admin_scripts: install mariadb-client from the galera repo [puppet] - 10https://gerrit.wikimedia.org/r/940330 [09:52:41] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage [09:53:26] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1015.eqiad.wmnet with OS bookworm [09:55:17] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage [09:57:14] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/940330/42622/" [puppet] - 10https://gerrit.wikimedia.org/r/940330 (owner: 10Arturo Borrero Gonzalez) [09:57:56] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host lvs1015.eqiad.wmnet with OS bookworm [09:58:12] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1015.eqiad.wmnet with OS bookworm [09:58:31] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1005.eqiad.wmnet with OS bullseye [09:58:43] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1005.eqiad.wmnet with OS bullseye [10:02:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:04:12] (03PS2) 10JMeybohm: confd::instance: support prefix in non-main instances [puppet] - 10https://gerrit.wikimedia.org/r/940329 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [10:04:14] (03PS10) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [10:04:56] (03PS3) 10JMeybohm: confd::instance: support prefix in non-main instances [puppet] - 10https://gerrit.wikimedia.org/r/940329 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [10:04:58] (03PS11) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [10:07:12] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42623/console" [puppet] - 10https://gerrit.wikimedia.org/r/940329 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [10:07:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:08:56] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1015.eqiad.wmnet with reason: host reimage [10:11:27] (03PS4) 10JMeybohm: confd::instance: support prefix in non-main instances [puppet] - 10https://gerrit.wikimedia.org/r/940329 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [10:11:29] (03PS12) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [10:12:02] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1015.eqiad.wmnet with reason: host reimage [10:13:10] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage [10:13:25] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42624/console" [puppet] - 10https://gerrit.wikimedia.org/r/940329 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [10:14:01] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42625/console" [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:16:17] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage [10:19:48] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:19:51] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] confd::instance: support prefix in non-main instances [puppet] - 10https://gerrit.wikimedia.org/r/940329 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [10:24:06] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) [10:24:42] (03PS1) 10Btullis: Disable the ceph fsid mismatch check [puppet] - 10https://gerrit.wikimedia.org/r/940332 (https://phabricator.wikimedia.org/T330151) [10:25:14] (03CR) 10CI reject: [V: 04-1] Disable the ceph fsid mismatch check [puppet] - 10https://gerrit.wikimedia.org/r/940332 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:26:16] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [10:27:10] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [10:27:15] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1015.eqiad.wmnet with OS bookworm [10:27:46] (ConfdResourceFailed) firing: (2) confd resource _etc_kubernetes_pki_kube-apiserver-sa-certs.pem.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:29:15] my fault [10:30:18] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bookworm [10:33:26] PROBLEM - confd-k8s service on kubemaster2001 is CRITICAL: CRITICAL - Expecting active but unit confd-k8s is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:33:54] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1016.eqiad.wmnet with OS bookworm [10:34:07] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bookworm [10:34:46] (03PS2) 10Btullis: Disable the ceph fsid mismatch check [puppet] - 10https://gerrit.wikimedia.org/r/940332 (https://phabricator.wikimedia.org/T330151) [10:35:18] PROBLEM - confd-k8s service on kubestagemaster2001 is CRITICAL: CRITICAL - Expecting active but unit confd-k8s is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:38:35] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [10:39:57] (03PS1) 10JMeybohm: Add _etcd-client-ssl._tcp SRV records for k8s etcd clusters [dns] - 10https://gerrit.wikimedia.org/r/940334 (https://phabricator.wikimedia.org/T329826) [10:40:03] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42626/console" [puppet] - 10https://gerrit.wikimedia.org/r/940332 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:40:37] (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable the ceph fsid mismatch check [puppet] - 10https://gerrit.wikimedia.org/r/940332 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:44:09] (03CR) 10JMeybohm: [C: 03+2] Add _etcd-client-ssl._tcp SRV records for k8s etcd clusters [dns] - 10https://gerrit.wikimedia.org/r/940334 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:46:46] PROBLEM - confd-k8s service on dse-k8s-ctrl1001 is CRITICAL: CRITICAL - Expecting active but unit confd-k8s is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:49:17] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host lvs1016.eqiad.wmnet with OS bookworm [10:49:32] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bookworm [10:58:24] RECOVERY - confd-k8s service on kubestagemaster2001 is OK: OK - confd-k8s is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:59:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:01:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:01:18] PROBLEM - confd-k8s service on kubestagemaster1001 is CRITICAL: CRITICAL - Expecting active but unit confd-k8s is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:04:30] PROBLEM - confd-k8s service on kubestagemaster2001 is CRITICAL: CRITICAL - Expecting active but unit confd-k8s is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:05:56] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: add initial settings for cloudcontrol1005 as functional node [puppet] - 10https://gerrit.wikimedia.org/r/940336 (https://phabricator.wikimedia.org/T341495) [11:06:24] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:09:30] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:10:01] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/940336/42627/" [puppet] - 10https://gerrit.wikimedia.org/r/940336 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [11:16:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50278 bytes in 7.042 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:16:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:16:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:21:30] RECOVERY - confd-k8s service on kubemaster2001 is OK: OK - confd-k8s is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:22:22] RECOVERY - confd-k8s service on dse-k8s-ctrl1001 is OK: OK - confd-k8s is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:22:22] RECOVERY - confd-k8s service on kubestagemaster1001 is OK: OK - confd-k8s is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:30:55] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1005.eqiad.wmnet with OS bullseye [11:30:57] (03PS3) 10Ladsgroup: spicerack: Add config file for MySQL/MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/939302 [11:31:02] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] spicerack: Add config file for MySQL/MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/939302 (owner: 10Ladsgroup) [11:31:31] (03PS1) 10JMeybohm: Add dummy etcd_srv_name to kubernetes::clusetrs in CI [puppet] - 10https://gerrit.wikimedia.org/r/940339 (https://phabricator.wikimedia.org/T329826) [11:34:54] (03PS2) 10JMeybohm: Add dummy etcd_srv_name to kubernetes::clusetrs in CI [puppet] - 10https://gerrit.wikimedia.org/r/940339 (https://phabricator.wikimedia.org/T329826) [11:36:02] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42629/console" [puppet] - 10https://gerrit.wikimedia.org/r/940339 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [11:39:19] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Add dummy etcd_srv_name to kubernetes::clusetrs in CI [puppet] - 10https://gerrit.wikimedia.org/r/940339 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [11:42:10] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:57] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for WMDE Airflow - https://phabricator.wikimedia.org/T342424 (10Stevemunene) [11:45:36] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for WMDE Airflow - https://phabricator.wikimedia.org/T342424 (10Stevemunene) [11:46:18] (03PS1) 10Urbanecm: Add reassignMentees.php maintenance script [extensions/GrowthExperiments] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/940142 (https://phabricator.wikimedia.org/T330071) [11:47:16] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host lvs1016.eqiad.wmnet with OS bookworm [11:51:47] (03PS1) 10Gmodena: data-engineering: lower server for flink enrcihment app [alerts] - 10https://gerrit.wikimedia.org/r/940341 (https://phabricator.wikimedia.org/T340666) [11:53:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) Service Request 172470150 was successfully submitted. [11:58:51] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudcontrol1005: allow haproxy backend access by cloudlb nodes [puppet] - 10https://gerrit.wikimedia.org/r/940342 (https://phabricator.wikimedia.org/T341495) [12:03:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudcontrol1005: allow haproxy backend access by cloudlb nodes [puppet] - 10https://gerrit.wikimedia.org/r/940342 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [12:03:14] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bookworm [12:09:23] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RobH) [12:09:38] (03PS1) 10Arturo Borrero Gonzalez: eqiad1: depool cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/940344 (https://phabricator.wikimedia.org/T341495) [12:09:44] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RobH) >>! In T308339#9033372, @Marostegui wrote: > For db1131, it is a master. When do you plan to do this? I'd need a couple of days to remove its master role. So when I checked the list of... [12:10:01] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RobH) [12:10:17] (03CR) 10Arturo Borrero Gonzalez: "Merge this patch if you think the cloud control plane @ eqiad1 is broken." [puppet] - 10https://gerrit.wikimedia.org/r/940344 (https://phabricator.wikimedia.org/T341495) (owner: 10Arturo Borrero Gonzalez) [12:11:03] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) Sounds good to me Rob :-) If for any other reason we end up switching that host's role before Q2, I'll comment on this task [12:14:46] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1016.eqiad.wmnet with OS bookworm [12:16:25] (03PS1) 10Ladsgroup: Add hiera profile::spicerack::mysql_config_data merge options [puppet] - 10https://gerrit.wikimedia.org/r/940345 [12:16:30] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:17:00] (03PS2) 10Ladsgroup: Add hiera profile::spicerack::mysql_config_data merge options [puppet] - 10https://gerrit.wikimedia.org/r/940345 [12:17:04] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add hiera profile::spicerack::mysql_config_data merge options [puppet] - 10https://gerrit.wikimedia.org/r/940345 (owner: 10Ladsgroup) [12:17:06] PROBLEM - cinder-volume process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:17:30] PROBLEM - glance-api http on cloudcontrol1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:20:44] (03PS1) 10Ladsgroup: sre.mysql.clone: Read replication user and pass from spicerack config [cookbooks] - 10https://gerrit.wikimedia.org/r/940346 [12:22:59] (03CR) 10CI reject: [V: 04-1] sre.mysql.clone: Read replication user and pass from spicerack config [cookbooks] - 10https://gerrit.wikimedia.org/r/940346 (owner: 10Ladsgroup) [12:23:50] (03PS2) 10Ladsgroup: sre.mysql.clone: Read replication user and pass from spicerack config [cookbooks] - 10https://gerrit.wikimedia.org/r/940346 [12:24:26] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink task frontend in 10th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940347 (https://phabricator.wikimedia.org/T308135) [12:25:54] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RobH) [12:33:31] (03CR) 10Marostegui: [C: 03+1] sre.mysql.clone: Read replication user and pass from spicerack config [cookbooks] - 10https://gerrit.wikimedia.org/r/940346 (owner: 10Ladsgroup) [12:33:52] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: fix error messages [cookbooks] - 10https://gerrit.wikimedia.org/r/940349 (https://phabricator.wikimedia.org/T329722) [12:35:00] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts analytics1075.eqiad.wmnet [12:35:01] (03PS1) 10JMeybohm: confd::instance: Allow to specify multiple backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/940350 (https://phabricator.wikimedia.org/T329826) [12:35:18] !log jbond@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts analytics1075.eqiad.wmnet [12:35:39] (03CR) 10Ladsgroup: [C: 03+2] sre.mysql.clone: Read replication user and pass from spicerack config [cookbooks] - 10https://gerrit.wikimedia.org/r/940346 (owner: 10Ladsgroup) [12:37:20] (03PS2) 10JMeybohm: confd::instance: Allow to specify multiple backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/940350 (https://phabricator.wikimedia.org/T329826) [12:37:56] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts analytics1075.eqiad.wmnet [12:37:57] (03Merged) 10jenkins-bot: sre.mysql.clone: Read replication user and pass from spicerack config [cookbooks] - 10https://gerrit.wikimedia.org/r/940346 (owner: 10Ladsgroup) [12:38:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts analytics1075.eqiad.wmnet [12:38:56] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts analytics1075.eqiad.wmnet [12:39:18] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts analytics1075.eqiad.wmnet [12:39:42] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: fix error messages [cookbooks] - 10https://gerrit.wikimedia.org/r/940349 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [12:39:44] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42630/console" [puppet] - 10https://gerrit.wikimedia.org/r/940350 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:41:12] (03PS3) 10JMeybohm: confd::instance: Allow to specify multiple backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/940350 (https://phabricator.wikimedia.org/T329826) [12:41:56] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: fix error messages [cookbooks] - 10https://gerrit.wikimedia.org/r/940349 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [12:42:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:44:37] (03PS4) 10JMeybohm: confd::instance: Allow to specify multiple backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/940350 (https://phabricator.wikimedia.org/T329826) [12:46:21] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [12:46:35] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42632/console" [puppet] - 10https://gerrit.wikimedia.org/r/940350 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:47:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:20] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt rdb101[34] - jclark@cumin1001" [12:48:41] (03CR) 10Jgreen: [C: 03+1] Remove frav1002 monitoring, add it for frav1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940202 (https://phabricator.wikimedia.org/T342064) (owner: 10Dwisehaupt) [12:49:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt rdb101[34] - jclark@cumin1001" [12:49:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:49:21] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host rdb1013 [12:49:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please also modify confd::default_instance to still accept a node name and port and scheme and build an url from it to pass to confd::inst" [puppet] - 10https://gerrit.wikimedia.org/r/940350 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:50:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb1013 [12:50:43] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host rdb1014 [12:52:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb1014 [12:52:20] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host rdb1013.mgmt.eqiad.wmnet with reboot policy FORCED [12:52:22] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host rdb1014.mgmt.eqiad.wmnet with reboot policy FORCED [12:54:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] confd::instance: Allow to specify multiple backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/940350 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:59:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr) [13:02:29] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10jbond) @BTullis currently the idrac 8 is not supported by the firmware cookbook, you should have got a better error message and have sent a fix for that. We... [13:05:54] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:07:00] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10BTullis) >>! In T329722#9034108, @jbond wrote: > When we discussed this previously it was decided it wasn't worth the effort to maintain support for idrac 8... [13:09:09] (03CR) 10Jbond: [C: 03+1] "lgtm possible optimisation inline" [puppet] - 10https://gerrit.wikimedia.org/r/940201 (https://phabricator.wikimedia.org/T326657) (owner: 10Herron) [13:13:58] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10jbond) Thanks @BTullis and if things do start becoming more painful we can re-evaluate @Papaul is there still an action on this task or can it be closed? [13:23:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:23:37] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10Papaul) @jbond thank you we are all good. [13:32:18] (03PS5) 10Jbond: puppetserver: make notifying configurable [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) [13:34:41] (03CR) 10CI reject: [V: 04-1] puppetserver: make notifying configurable [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:35:29] (03PS1) 10Jclark-ctr: add rdb101[3-4] site.pp [puppet] - 10https://gerrit.wikimedia.org/r/940356 (https://phabricator.wikimedia.org/T326170) [13:36:09] Q: Is smokeping.wikimedia.org still a thing? Going through old docs; and cannot find a Phab ticket about killing it or such [13:36:47] (03CR) 10Jclark-ctr: [C: 03+2] add rdb101[3-4] site.pp [puppet] - 10https://gerrit.wikimedia.org/r/940356 (https://phabricator.wikimedia.org/T326170) (owner: 10Jclark-ctr) [13:36:56] Q: Also, https://config-master.wikimedia.org/ssh-fingerprints.txt is a 404 suddenly. Is that intended? [13:37:40] andre: https://phabricator.wikimedia.org/T169860 [13:38:05] whoah thanks [13:38:56] and https://config-master.wikimedia.org/known_hosts [13:39:19] that URL works, but the fingerprints one doesn't anymore [13:42:48] andre: https://gerrit.wikimedia.org/r/c/operations/puppet/+/936692 suggests it is intentional [13:43:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb1014.mgmt.eqiad.wmnet with reboot policy FORCED [13:43:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb1013.mgmt.eqiad.wmnet with reboot policy FORCED [13:44:20] eh, thanks, going to update my local bookmarks. Thanks everyone! [13:45:05] (03PS3) 10Giuseppe Lavagetto: mediawiki: add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/940189 (https://phabricator.wikimedia.org/T342356) [13:45:07] (03PS3) 10Giuseppe Lavagetto: admin: add mw-misc namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/940198 (https://phabricator.wikimedia.org/T341859) [13:45:09] (03PS3) 10Giuseppe Lavagetto: mw-misc: add deployment with support for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199 [13:45:55] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10bking) This is complete; moving to 'needs review'. [13:47:21] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] confd::instance: Allow to specify multiple backend hosts [puppet] - 10https://gerrit.wikimedia.org/r/940350 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [13:50:54] RECOVERY - confd-k8s service on kubestagemaster2001 is OK: OK - confd-k8s is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:52:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:54:47] (03PS1) 10JMeybohm: kubernetes::master: etcd_servers list being created to late [puppet] - 10https://gerrit.wikimedia.org/r/940357 (https://phabricator.wikimedia.org/T329826) [13:57:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:59:30] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: etcd_servers list being created to late [puppet] - 10https://gerrit.wikimedia.org/r/940357 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [14:01:53] (03PS2) 10Gmodena: data-engineering: lower serverity for flink enrcihment app [alerts] - 10https://gerrit.wikimedia.org/r/940341 (https://phabricator.wikimedia.org/T340666) [14:02:02] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bookworm [14:04:07] ^^ hope you are luckier than me [14:04:10] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove frav1002 monitoring, add it for frav1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940202 (https://phabricator.wikimedia.org/T342064) (owner: 10Dwisehaupt) [14:04:38] (03PS4) 10Giuseppe Lavagetto: admin: add mw-misc namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/940198 (https://phabricator.wikimedia.org/T341859) [14:04:40] (03PS4) 10Giuseppe Lavagetto: mw-misc: add deployment with support for noc.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/940199 [14:05:52] fabfur: we will see! [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:27] (03PS1) 10Ssingh: team-traffic: add service restart alert for bird [alerts] - 10https://gerrit.wikimedia.org/r/940359 [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:14:13] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1016.eqiad.wmnet with OS bookworm [14:14:22] (03PS5) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:15:06] (03PS4) 10Herron: prometheus: add prometheus.svc.site.wmnet SANs to cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/940201 (https://phabricator.wikimedia.org/T326657) [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:17:47] !log sudo ipmitool -I lanplus -H "lvs1016.mgmt.eqiad.wmnet" -U root -E chassis power cycle [14:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:21] (03PS1) 10Elukey: knative-serving: add more selectors to the Istio Gateway resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/940361 [14:19:55] (03CR) 10CI reject: [V: 04-1] knative-serving: add more selectors to the Istio Gateway resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/940361 (owner: 10Elukey) [14:20:47] (03PS6) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:20:49] (03PS2) 10Elukey: knative-serving: add more selectors to the Istio Gateway resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/940361 [14:21:24] (03CR) 10CI reject: [V: 04-1] knative-serving: add more selectors to the Istio Gateway resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/940361 (owner: 10Elukey) [14:22:04] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42635/console" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:23:21] (03PS7) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:23:35] (03PS1) 10JMeybohm: confd: Use the unit name as syslog identifier [puppet] - 10https://gerrit.wikimedia.org/r/940362 (https://phabricator.wikimedia.org/T341669) [14:24:00] (03CR) 10CI reject: [V: 04-1] confd: Use the unit name as syslog identifier [puppet] - 10https://gerrit.wikimedia.org/r/940362 (https://phabricator.wikimedia.org/T341669) (owner: 10JMeybohm) [14:24:37] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42636/console" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:25:04] (03CR) 10Filippo Giunchedi: [C: 03+1] team-traffic: add service restart alert for bird [alerts] - 10https://gerrit.wikimedia.org/r/940359 (owner: 10Ssingh) [14:25:49] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC pretty happy at https://puppet-compiler.wmflabs.org/output/940152/42636/kubernetes1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:25:51] (03PS2) 10JMeybohm: confd: Use the unit name as syslog identifier [puppet] - 10https://gerrit.wikimedia.org/r/940362 (https://phabricator.wikimedia.org/T341669) [14:26:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:27:45] (03CR) 10Herron: prometheus: add prometheus.svc.site.wmnet SANs to cfssl cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940201 (https://phabricator.wikimedia.org/T326657) (owner: 10Herron) [14:27:47] (ConfdResourceFailed) firing: (2) confd resource _etc_kubernetes_pki_kube-apiserver-sa-certs.pem.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:27:48] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42637/console" [puppet] - 10https://gerrit.wikimedia.org/r/940362 (https://phabricator.wikimedia.org/T341669) (owner: 10JMeybohm) [14:28:00] (03CR) 10Alexandros Kosiaris: "I 'd suggest abandoning this and working on https://gerrit.wikimedia.org/r/c/operations/puppet/+/940152" [puppet] - 10https://gerrit.wikimedia.org/r/938349 (owner: 10Cory Massaro) [14:30:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] confd: Use the unit name as syslog identifier [puppet] - 10https://gerrit.wikimedia.org/r/940362 (https://phabricator.wikimedia.org/T341669) (owner: 10JMeybohm) [14:30:23] (03PS3) 10Elukey: knative-serving: add more selectors to the Istio Gateway resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/940361 [14:30:29] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] confd: Use the unit name as syslog identifier [puppet] - 10https://gerrit.wikimedia.org/r/940362 (https://phabricator.wikimedia.org/T341669) (owner: 10JMeybohm) [14:31:56] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10jbond) 05Open→03Resolved [14:32:54] (03PS6) 10Jbond: puppetserver: make notifying configurable [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) [14:32:56] (03PS1) 10Jbond: puppetserver: Add a file to track when a service restart or reload is [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) [14:32:58] (03PS1) 10Jbond: motd: Add motd indicating services which need restarting [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) [14:33:46] (03CR) 10CI reject: [V: 04-1] puppetserver: Add a file to track when a service restart or reload is [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:33:48] (03CR) 10TChin: [C: 03+2] data-engineering: lower serverity for flink enrcihment app [alerts] - 10https://gerrit.wikimedia.org/r/940341 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [14:33:50] (03CR) 10Jbond: puppetserver: make notifying configurable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:34:31] (03PS2) 10Jbond: puppetserver: Add a file to track when a service restart or reload is [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) [14:34:45] (03PS2) 10Jbond: motd: Add motd indicating services which need restarting [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) [14:34:56] (03Merged) 10jenkins-bot: data-engineering: lower serverity for flink enrcihment app [alerts] - 10https://gerrit.wikimedia.org/r/940341 (https://phabricator.wikimedia.org/T340666) (owner: 10Gmodena) [14:35:01] (03CR) 10CI reject: [V: 04-1] puppetserver: Add a file to track when a service restart or reload is [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:35:07] PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:15] (03PS2) 10AOkoth: vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) [14:35:17] (03PS1) 10AOkoth: vrts: change blackbox http check [puppet] - 10https://gerrit.wikimedia.org/r/940367 (https://phabricator.wikimedia.org/T342366) [14:35:34] (03CR) 10CI reject: [V: 04-1] motd: Add motd indicating services which need restarting [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:36:01] (03CR) 10Ssingh: "Thanks for the review! Note: don't merge before Monday." [alerts] - 10https://gerrit.wikimedia.org/r/940359 (owner: 10Ssingh) [14:36:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/940201 (https://phabricator.wikimedia.org/T326657) (owner: 10Herron) [14:36:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:36:49] (03CR) 10Herron: [C: 03+2] prometheus: add prometheus.svc.site.wmnet SANs to cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/940201 (https://phabricator.wikimedia.org/T326657) (owner: 10Herron) [14:36:51] (03CR) 10jenkins-bot: motd: Add motd indicating services which need restarting [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:37:14] (03CR) 10Jforrester: [C: 03+1] Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [14:37:30] !log sudo ipmitool -I lanplus -H "lvs1016.mgmt.eqiad.wmnet" -U root -E chassis power off [14:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:25] (03PS1) 10Btullis: Fix the wwn of sas drives used for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940368 (https://phabricator.wikimedia.org/T330151) [14:38:36] (03PS3) 10Alexandros Kosiaris: wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) [14:38:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bookworm [14:39:27] (03CR) 10Alexandros Kosiaris: wikifunctions: Add AppArmor profile usage (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [14:39:48] (03CR) 10Alexandros Kosiaris: wikifunctions: Add AppArmor profile usage (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [14:39:50] (03PS7) 10Jbond: puppetserver: make notifying configurable [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) [14:39:52] (03PS3) 10Jbond: puppetserver: Add a file to track when a service restart or reload is [puppet] - 10https://gerrit.wikimedia.org/r/940365 (https://phabricator.wikimedia.org/T330490) [14:39:54] (03PS3) 10Jbond: motd: Add motd indicating services which need restarting [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) [14:39:56] (03PS1) 10JMeybohm: kubernetes::master: Fix confd template as keys start with / [puppet] - 10https://gerrit.wikimedia.org/r/940369 (https://phabricator.wikimedia.org/T329826) [14:41:07] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42641/console" [puppet] - 10https://gerrit.wikimedia.org/r/940369 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [14:41:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42640/console" [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:43:18] (03CR) 10Klausman: [C: 03+1] knative-serving: add more selectors to the Istio Gateway resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/940361 (owner: 10Elukey) [14:44:01] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/940367/42639/vrts1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/940367 (https://phabricator.wikimedia.org/T342366) (owner: 10AOkoth) [14:45:13] (03CR) 10Jbond: [V: 03+1] motd: Add motd indicating services which need restarting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940366 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:45:15] (03CR) 10Btullis: [C: 03+2] Fix the wwn of sas drives used for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940368 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [14:46:17] (03PS6) 10Jbond: install_server: drop Bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) [14:46:26] (03PS4) 10Jbond: kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064) [14:46:31] (03CR) 10Jbond: [C: 03+2] monitoring: fix bashisms and other minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [14:46:46] (03CR) 10Jbond: [C: 03+2] install_server: drop Bashisms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [14:46:49] (03CR) 10Elukey: [C: 03+2] knative-serving: add more selectors to the Istio Gateway resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/940361 (owner: 10Elukey) [14:46:53] (03CR) 10Jbond: [C: 03+2] kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [14:48:24] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Fix confd template as keys start with / [puppet] - 10https://gerrit.wikimedia.org/r/940369 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [14:50:04] (03CR) 10AOkoth: [C: 03+2] vrts: change blackbox http check [puppet] - 10https://gerrit.wikimedia.org/r/940367 (https://phabricator.wikimedia.org/T342366) (owner: 10AOkoth) [14:50:16] (03PS2) 10AOkoth: vrts: change blackbox http check [puppet] - 10https://gerrit.wikimedia.org/r/940367 (https://phabricator.wikimedia.org/T342366) [14:50:20] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:50:47] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10jbond) [14:53:14] (03PS4) 10Alexandros Kosiaris: wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) [14:53:17] (03PS1) 10Alexandros Kosiaris: admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) [14:54:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:54:43] 10SRE, 10LDAP-Access-Requests: Grant Access to Turnilo for Mpossoupe - https://phabricator.wikimedia.org/T342335 (10andrea.denisse) 05In progress→03Resolved Glad to read. I'll close this as resolved but feel free to reach out if there's anything else I can help you with. [14:55:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:55:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:57:03] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT podautoscalers) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:57:47] (ConfdResourceFailed) resolved: (2) confd resource _etc_kubernetes_pki_kube-apiserver-sa-certs.pem.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:58:20] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host lvs1016.eqiad.wmnet with OS bookworm [15:01:07] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:03] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT podautoscalers) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:02:53] (HelmReleaseBadStatus) firing: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:04:42] !log apine@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:04:46] !log apine@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:04:51] (03CR) 10Cory Massaro: [C: 03+2] Redeploy with new version of function-ochestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 (owner: 10Cory Massaro) [15:05:36] (03Merged) 10jenkins-bot: Redeploy with new version of function-ochestrator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940196 (owner: 10Cory Massaro) [15:05:59] !log apine@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:06:50] !log apine@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:09:36] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:10:05] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:10:55] RECOVERY - Check systemd state on vrts2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:01] !log ayounsi@cumin1001 START - Cookbook sre.network.sonic-ssh for network device lsw1-e8-eqiad [15:11:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.sonic-ssh (exit_code=0) for network device lsw1-e8-eqiad [15:12:53] (HelmReleaseBadStatus) resolved: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:14:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:18:25] (ProbeDown) firing: Service vrts1001.eqiad.wmnet:1443 has failed probes (http_vrts1001_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001.eqiad.wmnet:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:36] (03PS1) 10Cory Massaro: Remove comments from ORCHESTRATOR_CONFIG; ensure all JSON strings are quoted. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940375 [15:20:41] (03PS1) 10Jbond: vrts: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/940376 (https://phabricator.wikimedia.org/T95064) [15:27:41] (03PS1) 10AOkoth: Revert "vrts: change blackbox http check" [puppet] - 10https://gerrit.wikimedia.org/r/940144 [15:31:31] (03PS2) 10Jbond: vrts: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/940376 (https://phabricator.wikimedia.org/T95064) [15:31:46] (03Abandoned) 10Jbond: vrts: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/940376 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [15:32:38] (03PS1) 10Jbond: vtrs: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/940379 (https://phabricator.wikimedia.org/T95064) [15:33:40] (03CR) 10AOkoth: [C: 03+2] Revert "vrts: change blackbox http check" [puppet] - 10https://gerrit.wikimedia.org/r/940144 (owner: 10AOkoth) [15:35:10] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 90 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Read-only DB [15:35:24] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 90 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Read-only DB [15:35:51] (03CR) 10Sergio Gimeno: [C: 04-1] "Pending decision for kgwiki and klwiki in T308135#9034014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940347 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [15:43:26] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond) [15:48:10] (03PS1) 10Jbond: puppetdb-api: swap the production and next environments [puppet] - 10https://gerrit.wikimedia.org/r/940384 (https://phabricator.wikimedia.org/T342214) [15:48:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:49:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42642/console" [puppet] - 10https://gerrit.wikimedia.org/r/940384 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [15:58:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:30] (03PS1) 10Elukey: custom_deploy.d: add an extra selector to the istio ingress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/940386 [16:00:41] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: add an extra selector to the istio ingress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/940386 (owner: 10Elukey) [16:02:20] (03PS1) 10Btullis: Use different WWN values for SAS HDDs and SSDs for cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940388 (https://phabricator.wikimedia.org/T330151) [16:03:12] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10ssingh) @Fabfur and I observed the same issue on trying to reimage `lvs1016`. The cookbook starts the Debian installer, which completes successfully. The cookboo... [16:06:32] (03CR) 10Jdlrobson: [C: 03+1] "Mo: Please backport this using the backport calendar: https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz) [16:07:50] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42643/console" [puppet] - 10https://gerrit.wikimedia.org/r/940388 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [16:09:17] (03CR) 10Btullis: [V: 03+1 C: 03+2] Use different WWN values for SAS HDDs and SSDs for cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/940388 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [16:23:45] (03CR) 10Klausman: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/940391 (owner: 10Klausman) [16:37:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:20:21] (03PS1) 10Jbond: cumin::unprivmaster: Test using puppetdbapi-next [puppet] - 10https://gerrit.wikimedia.org/r/940396 (https://phabricator.wikimedia.org/T342214) [17:21:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42644/console" [puppet] - 10https://gerrit.wikimedia.org/r/940396 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [17:23:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:24:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:48] (03PS2) 10Jbond: cumin::unprivmaster: Test using puppetdbapi-next [puppet] - 10https://gerrit.wikimedia.org/r/940396 (https://phabricator.wikimedia.org/T342214) [17:27:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42645/console" [puppet] - 10https://gerrit.wikimedia.org/r/940396 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [17:28:52] (03CR) 10Jbond: [V: 03+1] "Will self self merge as this machine is not used" [puppet] - 10https://gerrit.wikimedia.org/r/940396 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [17:30:52] (03PS1) 10Daimona Eaytoy: beta: Add missing override for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940398 (https://phabricator.wikimedia.org/T342452) [17:32:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1152.eqiad.wmnet with OS bullseye [17:32:39] (03CR) 10Jbond: [V: 03+1 C: 03+2] cumin::unprivmaster: Test using puppetdbapi-next [puppet] - 10https://gerrit.wikimedia.org/r/940396 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [17:32:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1152.eqiad.wmnet with OS bullseye [17:36:11] (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension before loading CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940399 (https://phabricator.wikimedia.org/T342452) [17:36:50] (03PS2) 10Daimona Eaytoy: Enable the CampaignEvents extension before loading CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940399 (https://phabricator.wikimedia.org/T342452) [17:37:21] 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10RobH) [17:37:29] 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10RobH) [17:39:34] Hi, I know it's Friday, but would it be possible to make a beta-only deployment? Or does Deployments/Emergencies apply to beta changes, too? [17:39:44] (Context: T342452) [17:39:45] T342452: Beta overrides for CampaignEvents settings are being ignored - https://phabricator.wikimedia.org/T342452 [17:41:15] Daimona: I think a beta-only change would be fine as long as you're around to deal with any consequences. [17:41:39] I'll be around for the next 7 hours :) [17:41:45] Perfect [17:42:07] But I should also add, I'm not a deployer myself. [17:42:46] I can deploy for you. [17:44:36] Thanks, that'd be much appreciated :) [17:46:10] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10RobH) [17:46:54] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10RobH) [17:46:58] OK. Let me know when you're ready and what the gerrit change number is [17:47:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:08] It's the two patches linked to T342452. The first one is really a no-op, but it's needed for the second one not to break anything [17:51:09] T342452: Beta overrides for CampaignEvents settings are being ignored - https://phabricator.wikimedia.org/T342452 [17:51:17] (And I'm ready whenever you are) [17:52:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1152.eqiad.wmnet with reason: host reimage [17:54:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940398 (https://phabricator.wikimedia.org/T342452) (owner: 10Daimona Eaytoy) [17:56:06] PROBLEM - Check systemd state on puppetdb1002 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetdb-microservice.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:46] (03Merged) 10jenkins-bot: beta: Add missing override for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940398 (https://phabricator.wikimedia.org/T342452) (owner: 10Daimona Eaytoy) [17:57:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1152.eqiad.wmnet with reason: host reimage [17:58:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940399 (https://phabricator.wikimedia.org/T342452) (owner: 10Daimona Eaytoy) [17:58:53] (03Merged) 10jenkins-bot: Enable the CampaignEvents extension before loading CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940399 (https://phabricator.wikimedia.org/T342452) (owner: 10Daimona Eaytoy) [17:59:08] !log dancy@deploy1002 Started scap: Backport for [[gerrit:940399|Enable the CampaignEvents extension before loading CommonSettings-labs (T342452)]] [17:59:11] T342452: Beta overrides for CampaignEvents settings are being ignored - https://phabricator.wikimedia.org/T342452 [17:59:19] RECOVERY - Check systemd state on puppetdb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:25] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10RobH) [17:59:34] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10RobH) [18:00:44] !log dancy@deploy1002 daimona and dancy: Backport for [[gerrit:940399|Enable the CampaignEvents extension before loading CommonSettings-labs (T342452)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [18:01:43] Daimona: Can you exercise one of the test servers? [18:06:17] Yup, will do immediately [18:10:29] So, things are working now and looking good, but one of the side effects of this change is that the campaignevents-beta-tester user group now also exists in beta. But it's not causing any harm, it's just useless [18:10:57] OK. Moving on [18:11:47] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10RobH) [18:12:23] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10RobH) [18:12:52] (And TBH, I'm not even sure if the group didn't exist before) [18:12:57] Thank you! [18:14:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:16:39] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:940399|Enable the CampaignEvents extension before loading CommonSettings-labs (T342452)]] (duration: 17m 31s) [18:16:43] T342452: Beta overrides for CampaignEvents settings are being ignored - https://phabricator.wikimedia.org/T342452 [18:16:49] Daimona: All set! [18:17:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:18:32] (03PS1) 10Daimona Eaytoy: beta: Remove unneeded campaignevents-beta-tester user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940400 (https://phabricator.wikimedia.org/T342452) [18:20:19] PROBLEM - Check systemd state on puppetdb1002 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetdb-microservice.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:21:43] RECOVERY - Check systemd state on puppetdb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:30:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1152.eqiad.wmnet with OS bullseye [18:30:55] PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetdb-microservice.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1152.eqiad.wmnet with OS bullseye [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:44:31] RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1152.eqiad.wmnet with reason: host reimage [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:48:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1152.eqiad.wmnet with reason: host reimage [18:57:08] (03PS17) 10Ahmon Dancy: Scap: scap_source Use the "group" consistently [puppet] - 10https://gerrit.wikimedia.org/r/361796 (owner: 10Thcipriani) [18:58:33] (03PS18) 10Ahmon Dancy: Scap: scap_source Use the "group" consistently [puppet] - 10https://gerrit.wikimedia.org/r/361796 (https://phabricator.wikimedia.org/T342320) (owner: 10Thcipriani) [19:04:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1152.eqiad.wmnet with OS bullseye [19:04:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1152.eqiad.wmnet with OS bullseye completed: - an-worker1152 (*... [19:06:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [19:07:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1151.eqiad.wmnet with OS bullseye [19:07:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1151.eqiad.wmnet with OS bullseye [19:09:39] (03PS1) 10Jbond: (WIP) puppetdb-microservice: update puppetdb micro service so it streams data [puppet] - 10https://gerrit.wikimedia.org/r/940403 (https://phabricator.wikimedia.org/T342458) [19:15:07] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:17:41] PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetdb-microservice.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:44] (03PS1) 10Ahmon Dancy: Remove unreferenced profile::kubernetes::deployment_server::git_{owner,group} from hiera data [puppet] - 10https://gerrit.wikimedia.org/r/940406 [19:18:07] (03CR) 10CI reject: [V: 04-1] Remove unreferenced profile::kubernetes::deployment_server::git_{owner,group} from hiera data [puppet] - 10https://gerrit.wikimedia.org/r/940406 (owner: 10Ahmon Dancy) [19:18:43] (03PS2) 10Ahmon Dancy: Remove unreferenced hiera data [puppet] - 10https://gerrit.wikimedia.org/r/940406 [19:21:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1151.eqiad.wmnet with reason: host reimage [19:23:37] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:24:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1151.eqiad.wmnet with reason: host reimage [19:40:01] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:41:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:41:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1151.eqiad.wmnet with OS bullseye [19:41:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1151.eqiad.wmnet with OS bullseye completed: - an-worker1151 (*... [19:42:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [19:43:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1150.eqiad.wmnet with OS bullseye [19:43:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1150.eqiad.wmnet with OS bullseye [19:57:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1150.eqiad.wmnet with reason: host reimage [20:02:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1150.eqiad.wmnet with reason: host reimage [20:04:11] !log apine@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:04:13] !log apine@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:04:35] (03CR) 10Cory Massaro: [C: 03+2] Remove comments from ORCHESTRATOR_CONFIG; ensure all JSON strings are quoted. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940375 (owner: 10Cory Massaro) [20:05:24] (03Merged) 10jenkins-bot: Remove comments from ORCHESTRATOR_CONFIG; ensure all JSON strings are quoted. [deployment-charts] - 10https://gerrit.wikimedia.org/r/940375 (owner: 10Cory Massaro) [20:14:31] !log apine@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:15:02] !log apine@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:17:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:19:13] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10cmassaro) Calls to the orchestrator now work! W... [20:19:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:19:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1150.eqiad.wmnet with OS bullseye [20:19:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1150.eqiad.wmnet with OS bullseye completed: - an-worker1150 (*... [20:20:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [20:21:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1149.eqiad.wmnet with OS bullseye [20:21:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye [20:24:24] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10cmassaro) @JMeybohm, it appears that there's a... [20:35:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1149.eqiad.wmnet with reason: host reimage [20:38:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1149.eqiad.wmnet with reason: host reimage [20:40:48] RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:04] (03PS2) 10Jbond: (WIP) puppetdb-microservice: update puppetdb micro service so it streams data [puppet] - 10https://gerrit.wikimedia.org/r/940403 (https://phabricator.wikimedia.org/T342458) [20:54:39] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:54:41] (03CR) 10Jbond: [C: 04-1] "this doesn't work" [puppet] - 10https://gerrit.wikimedia.org/r/940403 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond) [21:00:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:00:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1149.eqiad.wmnet with OS bullseye [21:00:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye completed: - an-worker1149 (*... [21:01:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [21:02:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) 05Open→03Resolved @BTullis finally finished. thanks for your patience. [21:23:00] 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10KOfori) [21:23:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:43:45] (03CR) 10Cwhite: [C: 03+2] logstash: remove thanos log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937604 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:37:36] 10SRE, 10Commons, 10Infrastructure-Foundations, 10Traffic, 10netops: $wgUseInstantCommons throws an SSL error - https://phabricator.wikimedia.org/T342473 (10Tgr) Usually this means some kind of man-in-the-middle scenario (or, less likely, misconfiguration at the target server) - you are getting a certifi... [23:01:14] 10SRE, 10Commons, 10Infrastructure-Foundations, 10Traffic, 10netops: $wgUseInstantCommons throws an SSL error - https://phabricator.wikimedia.org/T342473 (10Platonides) No problem connecting to commons.wikimedia.org from Germany. Note: connection from Germany = DigiCert wildcard signed by DigiCert TLS H... [23:15:26] 10SRE, 10Commons, 10Infrastructure-Foundations, 10Traffic, 10netops: $wgUseInstantCommons throws an SSL error - https://phabricator.wikimedia.org/T342473 (10taavi) This looks like https://bugs.launchpad.net/ubuntu/+source/curl/+bug/2028170. [23:34:42] 10SRE, 10Commons, 10Infrastructure-Foundations, 10Traffic, 10netops: $wgUseInstantCommons throws an SSL error - https://phabricator.wikimedia.org/T342473 (10OriginalAuthority) Looks like shutting down the server and booting it back up again fixed the issue. But yes, it probably is the issue above, taavi. [23:49:43] 10SRE, 10Commons, 10Infrastructure-Foundations, 10Traffic, 10netops: $wgUseInstantCommons throws an SSL error - https://phabricator.wikimedia.org/T342473 (10Platonides) 05Open→03Resolved a:03Platonides