[00:00:08] (03PS1) 10HMonroy: wikidiff2: set maxSplitSize = 10 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754) [00:01:44] (03PS11) 10Krinkle: webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 [00:06:14] (03PS1) 10Krinkle: webperf: Remove unused xhgui hierdata [labs/private] - 10https://gerrit.wikimedia.org/r/950050 (https://phabricator.wikimedia.org/T342724) [00:06:20] (03PS1) 10Krinkle: webperf: Add hieradata/role/common/webperf.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/950051 [00:06:26] (03PS1) 10Krinkle: webperf: Remove processors_and_site.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/950052 [00:06:50] (03CR) 10Krinkle: "Awaiting for:" [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [00:07:36] (03CR) 10Krinkle: "Please perform the same change in the real private repo as well!" [labs/private] - 10https://gerrit.wikimedia.org/r/950051 (owner: 10Krinkle) [00:10:06] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:14:24] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:04] (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:19:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949224 [00:38:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949224 (owner: 10TrainBranchBot) [00:46:09] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [labs/private] - 10https://gerrit.wikimedia.org/r/950051 (owner: 10Krinkle) [00:46:50] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:00] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [labs/private] - 10https://gerrit.wikimedia.org/r/950050 (https://phabricator.wikimedia.org/T342724) (owner: 10Krinkle) [00:54:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949224 (owner: 10TrainBranchBot) [01:01:00] (03CR) 10Andrea Denisse: [C: 03+2] webperf: Remove unused xhgui hierdata [labs/private] - 10https://gerrit.wikimedia.org/r/950050 (https://phabricator.wikimedia.org/T342724) (owner: 10Krinkle) [01:01:02] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] webperf: Remove unused xhgui hierdata [labs/private] - 10https://gerrit.wikimedia.org/r/950050 (https://phabricator.wikimedia.org/T342724) (owner: 10Krinkle) [01:12:16] (03CR) 10Andrea Denisse: [C: 03+2] "I added this change to the private repository." [labs/private] - 10https://gerrit.wikimedia.org/r/950051 (owner: 10Krinkle) [01:12:18] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] webperf: Add hieradata/role/common/webperf.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/950051 (owner: 10Krinkle) [01:15:36] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] "Change applied in the private repository as well." [labs/private] - 10https://gerrit.wikimedia.org/r/950050 (https://phabricator.wikimedia.org/T342724) (owner: 10Krinkle) [01:29:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:33:09] (03PS2) 10Majavah: Set WRITE_BOTH for OAuth multiple devices to checkuserwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949629 (https://phabricator.wikimedia.org/T242031) [01:33:20] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:33:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949629 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [01:34:10] (03Merged) 10jenkins-bot: Set WRITE_BOTH for OAuth multiple devices to checkuserwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949629 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [01:34:32] !log taavi@deploy1002 Started scap: Backport for [[gerrit:949629|Set WRITE_BOTH for OAuth multiple devices to checkuserwiki (T242031)]] [01:34:41] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [01:34:45] If anything breaks it's taavi's fault [01:34:53] amir said it's fine [01:35:03] [citation needed] [01:35:18] {{cite personal experience}} [01:35:33] WP:NOR [01:36:13] !log taavi@deploy1002 taavi: Backport for [[gerrit:949629|Set WRITE_BOTH for OAuth multiple devices to checkuserwiki (T242031)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [01:36:18] !log taavi@deploy1002 taavi: Continuing with sync [01:38:08] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:21] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949629|Set WRITE_BOTH for OAuth multiple devices to checkuserwiki (T242031)]] (duration: 07m 48s) [01:42:25] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [01:42:28] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:20] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [02:06:39] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:39] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:39:24] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:48:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:10] (03PS2) 10Tim Starling: ResourceLoader: Forwards-compatible mw.loader.impl() [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947914 (https://phabricator.wikimedia.org/T343407) [02:52:24] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:33] (03CR) 10Tim Starling: [C: 03+2] ResourceLoader: Forwards-compatible mw.loader.impl() [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947914 (https://phabricator.wikimedia.org/T343407) (owner: 10Tim Starling) [02:54:36] (03PS1) 10KartikMistry: Update MinT to 2023-08-14-091403-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/950063 (https://phabricator.wikimedia.org/T336683) [02:55:48] (03CR) 10Tim Starling: ResourceLoader: Forwards-compatible mw.loader.impl() [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947914 (https://phabricator.wikimedia.org/T343407) (owner: 10Tim Starling) [03:46:05] (03CR) 10Tim Starling: [C: 03+1] wikidiff2: set maxSplitSize = 10 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754) (owner: 10HMonroy) [04:02:06] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:14] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:13:12] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:05] (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:16:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:16] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:26] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:10] 10SRE, 10Lingua Libre, 10Traffic: Network issue between LinguaLibre and Wikimedia Commons - https://phabricator.wikimedia.org/T344421 (10mickeybarber) 05Open→03Resolved a:03mickeybarber Thx all for your help. It's fixed. It was an old network conf the problem: Commons name resolution was hard-fixed fo... [04:48:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:49:33] (03Abandoned) 10Tim Starling: ResourceLoader: Forwards-compatible mw.loader.impl() [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947914 (https://phabricator.wikimedia.org/T343407) (owner: 10Tim Starling) [04:53:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:57:20] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:40] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (install3003), Fresh: 127 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:10:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:24] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:26:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:26] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:19] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:34:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/948686 (https://phabricator.wikimedia.org/T344199) (owner: 10Cwhite) [05:41:19] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:42:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast3007.wikimedia.org [05:43:47] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [05:46:26] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:48:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [05:48:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host bast3007.wikimedia.org [05:58:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:02] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 128 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230818T0600) [06:02:14] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:17] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) >>! In T210704#9089606, @Aklapper wrote: > * Is `3d2png` superseded by T267327 (per T225678... [06:15:01] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) [06:16:16] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:09] (03PS2) 10Majavah: admin: New SSH key for zabe [puppet] - 10https://gerrit.wikimedia.org/r/949999 (owner: 10Zabe) [06:28:32] (03CR) 10Majavah: [C: 03+2] "also verified in-person" [puppet] - 10https://gerrit.wikimedia.org/r/949999 (owner: 10Zabe) [06:51:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:53:02] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:56:42] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [06:57:14] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:57:14] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [06:57:43] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:58:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230818T0700) [07:03:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:15] (03PS1) 10JMeybohm: Revert "Revert "Remove limits in ResourceQuota and container limitanges for mediawiki"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950068 [07:07:20] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:43] (03PS2) 10JMeybohm: Revert "Revert "Remove limits in ResourceQuota and container limitanges for mediawiki"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950068 (https://phabricator.wikimedia.org/T343978) [07:08:06] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [07:14:07] (03PS3) 10JMeybohm: Revert "Revert "Remove limits in ResourceQuota and container limitanges for mediawiki"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950068 (https://phabricator.wikimedia.org/T343978) [07:16:00] (03CR) 10CI reject: [V: 04-1] Revert "Revert "Remove limits in ResourceQuota and container limitanges for mediawiki"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950068 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [07:22:31] (03PS4) 10JMeybohm: Revert "Revert "Remove limits in ResourceQuota and container limitanges for mediawiki"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950068 (https://phabricator.wikimedia.org/T343978) [07:34:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [07:41:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [07:42:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:25] (03PS1) 10Zabe: manage-dblist: Add lang to langlist if not present [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950132 [07:46:16] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [07:46:25] (03PS1) 10Muehlenhoff: standard_packages: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/950133 [07:46:26] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:47:43] (03PS3) 10Cathal Mooney: Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045 [07:48:35] (03CR) 10CI reject: [V: 04-1] Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045 (owner: 10Cathal Mooney) [07:50:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950133 (owner: 10Muehlenhoff) [07:54:05] (03PS1) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) [07:54:39] (03CR) 10CI reject: [V: 04-1] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [08:00:23] (03CR) 10Klausman: prometheus: Add recording rules for istio traffic on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [08:00:50] (03PS4) 10Cathal Mooney: Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045 [08:01:01] (03CR) 10Gehel: [C: 04-1] "Previous version of this patch (I02f5bacfa36c985969b2ddbeef4257caf92dddb1 and I384839223b29621d34d9cd5fe095ab845f37b3a3) have been reverte" [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [08:01:47] (03CR) 10CI reject: [V: 04-1] Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045 (owner: 10Cathal Mooney) [08:07:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1002.wikimedia.org [08:08:06] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:18] (03CR) 10Clément Goubert: [C: 03+1] Revert "Revert "Remove limits in ResourceQuota and container limitanges for mediawiki"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950068 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [08:10:30] (03CR) 10JMeybohm: [C: 03+2] Revert "Revert "Remove limits in ResourceQuota and container limitanges for mediawiki"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950068 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [08:11:47] (03PS1) 10Clément Goubert: mediawiki: Add exporter limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/950138 (https://phabricator.wikimedia.org/T342748) [08:12:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1002.wikimedia.org [08:12:26] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:55] (03Merged) 10jenkins-bot: Revert "Revert "Remove limits in ResourceQuota and container limitanges for mediawiki"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950068 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [08:13:10] (03CR) 10Muehlenhoff: Start Blazegraph from systemd unit, without runBlazegraph.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [08:15:06] (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:15:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [08:17:27] (03PS2) 10Clément Goubert: mediawiki: Add exporter limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/950138 (https://phabricator.wikimedia.org/T342748) [08:21:03] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:21:35] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:22:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [08:23:21] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:24:40] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:24:49] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:24:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [08:25:36] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:26:24] (03CR) 10David Caro: [C: 03+1] "LGTM, just to verify what I saw on PCC, no real changes right?" [puppet] - 10https://gerrit.wikimedia.org/r/944937 (owner: 10Muehlenhoff) [08:33:44] (03CR) 10Muehlenhoff: cloudcephosd: Remove Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944937 (owner: 10Muehlenhoff) [08:36:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti-test2002.codfw.wmnet [08:36:46] PROBLEM - Check systemd state on ganeti-test2002 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:36] (03PS3) 10Clément Goubert: mediawiki: Add exporter limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/950138 (https://phabricator.wikimedia.org/T342748) [08:42:34] RECOVERY - Check systemd state on ganeti-test2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:42] (03PS1) 10Muehlenhoff: Also run spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/950141 [08:45:15] (03PS4) 10Clément Goubert: mediawiki: Add exporter limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/950138 (https://phabricator.wikimedia.org/T342748) [08:47:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/950133 (owner: 10Muehlenhoff) [08:47:46] (03PS1) 10Muehlenhoff: autoinstall: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/950143 [08:48:00] (03PS5) 10Clément Goubert: mediawiki: Add exporter limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/950138 (https://phabricator.wikimedia.org/T342748) [08:50:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950143 (owner: 10Muehlenhoff) [08:50:20] (03PS6) 10Clément Goubert: mediawiki: Add exporter limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/950138 (https://phabricator.wikimedia.org/T342748) [08:54:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/950141 (owner: 10Muehlenhoff) [08:58:23] (03PS1) 10Clément Goubert: admin: Add mareikeheuer to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/950144 (https://phabricator.wikimedia.org/T344341) [08:59:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:03:26] (03CR) 10MVernon: [V: 03+2 C: 03+2] hiera: add fake credential for swift user search_update_pipeline [labs/private] - 10https://gerrit.wikimedia.org/r/949944 (https://phabricator.wikimedia.org/T342620) (owner: 10MVernon) [09:03:37] (03CR) 10MVernon: [C: 03+2] hiera: add swift user search_update_pipeline [puppet] - 10https://gerrit.wikimedia.org/r/949943 (https://phabricator.wikimedia.org/T342620) (owner: 10MVernon) [09:04:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:04:22] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10Clement_Goubert) >>! In T344257#9099818, @mpopov wrote: > @Clement_Goubert: Thank you! For my own future reference and @OSefu-... [09:04:33] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10Clement_Goubert) [09:05:06] (03PS1) 10Muehlenhoff: Blacklist exfat [puppet] - 10https://gerrit.wikimedia.org/r/950145 [09:05:24] (03PS2) 10Muehlenhoff: Blacklist exfat [puppet] - 10https://gerrit.wikimedia.org/r/950145 [09:06:48] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Clement_Goubert) 05Resolved→03Open This new key has been added to WMCS apparently: ` darthmon uses the same SSH key(s) in WMCS and production: {'AAAAC3NzaC1lZDI1NTE5AAAAI... [09:07:21] (03PS18) 10David Caro: WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [09:09:53] (03CR) 10CI reject: [V: 04-1] WIP: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [09:11:02] (03CR) 10JMeybohm: [C: 03+2] mediawiki: Add exporter limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/950138 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:12:13] (03Merged) 10jenkins-bot: mediawiki: Add exporter limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/950138 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:13:21] !log roll-restart thanos swift frontends to add user T342620 [09:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:25] T342620: Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 [09:13:31] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [09:15:12] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search-update-pipeline:prod.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:53] 10SRE-swift-storage, 10Commons: HTTP 503 error on action=revert for specific Commons file - https://phabricator.wikimedia.org/T344480 (10Aklapper) [09:18:06] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:19:30] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [09:20:10] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [09:20:17] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [09:20:33] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [09:20:58] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [09:21:06] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:52] 10SRE-swift-storage, 10Commons: HTTP 503 error on action=revert for specific Commons file - https://phabricator.wikimedia.org/T344480 (10MatthewVernon) If I visit the link in the bug report, the error message I get is "Unexpected value: "oldimage"="". [09:23:22] (03PS19) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [09:23:37] 10SRE-swift-storage, 10Data-Persistence, 10Data-Platform-SRE, 10Discovery-Search (Current work): Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This is done now. [09:25:22] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:56] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [09:27:43] (03PS20) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [09:30:18] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [09:31:03] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:31:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/950144 (https://phabricator.wikimedia.org/T344341) (owner: 10Clément Goubert) [09:32:38] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10MoritzMuehlenhoff) >>! In T344341#9099886, @KFrancis wrote: > The NDA has been signed. Please proceed with next steps. Thank you! Thanks! Can you please... [09:32:44] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reverses for new link addressing esams - cmooney@cumin1001" [09:34:12] (03PS5) 10Cathal Mooney: Reverse includes for new esams ranges [dns] - 10https://gerrit.wikimedia.org/r/950045 [09:34:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/950145 (owner: 10Muehlenhoff) [09:35:02] (03PS2) 10Clément Goubert: mw-api-int: Set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/949957 (https://phabricator.wikimedia.org/T342748) [09:35:18] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/950143 (owner: 10Muehlenhoff) [09:35:57] (03CR) 10Cathal Mooney: [C: 03+2] Reverse includes for new esams ranges [dns] - 10https://gerrit.wikimedia.org/r/950045 (owner: 10Cathal Mooney) [09:36:31] (03CR) 10Clément Goubert: [C: 03+2] admin: Add mareikeheuer to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/950144 (https://phabricator.wikimedia.org/T344341) (owner: 10Clément Goubert) [09:38:06] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:18] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:22] 10SRE-swift-storage, 10Commons, 10MediaWiki-Revision-deletion: HTTP 503 error on action=revert for specific Commons file - https://phabricator.wikimedia.org/T344480 (10Aklapper) Same here, reminds me of {T328112} [09:44:50] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reverses for new link addressing esams - cmooney@cumin1001" [09:44:50] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:44:57] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_Goubert ` cgoubert@mwmaint1002:~$ ldapsearch -x cn=wmde | grep mareikeheuer... [09:44:58] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:48:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:33] (03CR) 10JMeybohm: [C: 03+1] mw-api-int: Set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/949957 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:48:46] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/949957 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:49:32] (03Merged) 10jenkins-bot: mw-api-int: Set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/949957 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:50:27] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reverses for new link addressing esams - cmooney@cumin1001" [09:50:29] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [09:51:34] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [09:51:48] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:52:22] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reverses for new link addressing esams - cmooney@cumin1001" [09:52:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:53:22] !log upgrade idp-test to OpenJDK 11.0.20 [09:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:44] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [09:54:51] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [09:55:16] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:55:31] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [09:56:34] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:01:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:27] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reverses for new link addressing esams - cmooney@cumin1001" [10:02:29] (03CR) 10Hnowlan: [C: 03+2] aqs: enable geo_analytics user [puppet] - 10https://gerrit.wikimedia.org/r/949947 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [10:03:10] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add reverses for new link addressing esams - cmooney@cumin1001" [10:03:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:03:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast3007.wikimedia.org [10:03:45] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:05:24] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:46] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3007.wikimedia.org - jmm@cumin2002" [10:06:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast3007.wikimedia.org - jmm@cumin2002" [10:06:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:06:31] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast3007.wikimedia.org on all recursors [10:06:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast3007.wikimedia.org on all recursors [10:07:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast3007.wikimedia.org - jmm@cumin2002" [10:07:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast3007.wikimedia.org - jmm@cumin2002" [10:11:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast3007.wikimedia.org with OS bookworm [10:22:09] (03PS1) 10Muehlenhoff: Add component/wmf-laptop for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/950152 [10:32:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast3007.wikimedia.org with reason: host reimage [10:35:38] (03PS1) 10Jgiannelos: Enable aws-sdk (s3) debug logging [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/950155 (https://phabricator.wikimedia.org/T344324) [10:35:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast3007.wikimedia.org with reason: host reimage [10:44:46] (03CR) 10MVernon: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/950146 (https://phabricator.wikimedia.org/T344257) (owner: 10Clément Goubert) [10:45:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/950146 (https://phabricator.wikimedia.org/T344257) (owner: 10Clément Goubert) [10:46:27] (03CR) 10Clément Goubert: [C: 03+2] admin: Add osefu to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/950146 (https://phabricator.wikimedia.org/T344257) (owner: 10Clément Goubert) [10:50:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast3007.wikimedia.org with OS bookworm [10:50:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast3007.wikimedia.org [10:53:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org [10:53:47] (03PS1) 10Btullis: Update the default user role in Superset to be 'WMF Analyst' [puppet] - 10https://gerrit.wikimedia.org/r/950157 (https://phabricator.wikimedia.org/T328457) [10:55:53] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new bastion - jmm@cumin2002" [10:56:04] (03PS1) 10Clément Goubert: admin: Add kerberos acces to osefu [puppet] - 10https://gerrit.wikimedia.org/r/950158 (https://phabricator.wikimedia.org/T344257) [10:57:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2002.wikimedia.org [10:57:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new bastion - jmm@cumin2002" [10:58:38] (03CR) 10Btullis: [C: 03+1] admin: Add kerberos acces to osefu [puppet] - 10https://gerrit.wikimedia.org/r/950158 (https://phabricator.wikimedia.org/T344257) (owner: 10Clément Goubert) [10:58:52] (03CR) 10Clément Goubert: [C: 03+2] admin: Add kerberos acces to osefu [puppet] - 10https://gerrit.wikimedia.org/r/950158 (https://phabricator.wikimedia.org/T344257) (owner: 10Clément Goubert) [11:00:02] (03PS1) 10Muehlenhoff: Make bast3007 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/950159 (https://phabricator.wikimedia.org/T344355) [11:00:34] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [11:02:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/950152 (owner: 10Muehlenhoff) [11:04:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10Clement_Goubert) a:05OSefu-WMF→03BTullis The shell access has been created, as well as the kerberos... [11:05:34] (03CR) 10Effie Mouzeli: [C: 03+1] Enable aws-sdk (s3) debug logging [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/950155 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos) [11:06:01] (03CR) 10Muehlenhoff: [C: 03+2] Add component/wmf-laptop for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/950152 (owner: 10Muehlenhoff) [11:06:37] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:06:40] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:07:42] (03CR) 10Jgiannelos: [C: 03+2] Enable aws-sdk (s3) debug logging [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/950155 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos) [11:08:23] (03Merged) 10jenkins-bot: Enable aws-sdk (s3) debug logging [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/950155 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos) [11:11:40] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:14:43] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump to latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/950160 [11:15:26] (03PS1) 10Urbanecm: SpecialGlobalGroupMembership: Normalize usernames [extensions/CentralAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950071 (https://phabricator.wikimedia.org/T344495) [11:15:34] (03CR) 10Joal: [C: 03+1] "Ok for me - I wonder if we should automate the creation of the role or if it's ok to have it manually created." [puppet] - 10https://gerrit.wikimedia.org/r/950157 (https://phabricator.wikimedia.org/T328457) (owner: 10Btullis) [11:17:18] (03PS2) 10Effie Mouzeli: tegola-vector-tiles: Bump to latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/950160 (owner: 10Jgiannelos) [11:23:36] (03PS3) 10Effie Mouzeli: tegola-vector-tiles: use tegola image with debug enabled on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/950160 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos) [11:24:03] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: use tegola image with debug enabled on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/950160 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos) [11:25:05] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: use tegola image with debug enabled on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/950160 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos) [11:25:47] (03Merged) 10jenkins-bot: tegola-vector-tiles: use tegola image with debug enabled on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/950160 (https://phabricator.wikimedia.org/T344324) (owner: 10Jgiannelos) [11:27:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/950159 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [11:29:35] (03CR) 10Muehlenhoff: [C: 03+2] Make bast3007 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/950159 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [11:30:07] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [11:31:05] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [11:39:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:49] PROBLEM - Check systemd state on kubernetes1006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:22] (03PS8) 10Hnowlan: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) [11:46:00] (03CR) 10CI reject: [V: 04-1] thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [11:47:15] (03PS9) 10Hnowlan: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) [11:49:57] RECOVERY - Check systemd state on kubernetes1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:07] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:25] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:30] (03CR) 10Muehlenhoff: [C: 03+1] thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [12:10:28] (03CR) 10Btullis: Update the default user role in Superset to be 'WMF Analyst' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950157 (https://phabricator.wikimedia.org/T328457) (owner: 10Btullis) [12:14:46] (03PS1) 10ArielGlenn: Update scatter.red's dump mirrors hostname [puppet] - 10https://gerrit.wikimedia.org/r/950164 [12:15:08] (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:22:40] (03CR) 10ArielGlenn: [C: 03+2] Update scatter.red's dump mirrors hostname [puppet] - 10https://gerrit.wikimedia.org/r/950164 (owner: 10ArielGlenn) [12:23:07] 10SRE-swift-storage, 10collaboration-services: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) Testing is going well so far. Next week, we plan to test failing over to a different datacentre to confirm that we can read objects from Swift in codfw that were written i... [12:25:43] (03PS1) 10Btullis: Add the members package to base::standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/950166 [12:27:02] 10SRE-swift-storage, 10Commons, 10MediaWiki-Revision-deletion: HTTP 503 error on action=revert for specific Commons file - https://phabricator.wikimedia.org/T344480 (10C.Suthorn) [12:28:30] 10SRE-swift-storage, 10Commons, 10MediaWiki-Revision-deletion: HTTP 503 error on action=revert for specific Commons file - https://phabricator.wikimedia.org/T344480 (10C.Suthorn) Oldimage is a needed parameter and obviously hidden in a post parameter. I have changed the URL ins this ticket, but of course the... [12:29:27] 10SRE-swift-storage, 10Commons, 10MediaWiki-Revision-deletion: HTTP 503 error on action=revert for specific Commons file - https://phabricator.wikimedia.org/T344480 (10C.Suthorn) [12:34:04] (03PS1) 10Muehlenhoff: netbox: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/950167 [12:36:05] (03PS1) 10Sergio Gimeno: GrowthExperiments: turn off AddLink in aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950168 (https://phabricator.wikimedia.org/T344319) [12:36:46] (03CR) 10Sergio Gimeno: [C: 04-1] "Waiting for ambassador to inform aswiki community about the temporary disabling of AddLink task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950168 (https://phabricator.wikimedia.org/T344319) (owner: 10Sergio Gimeno) [12:43:03] (03CR) 10Muehlenhoff: "No objections, the tool is small enough." [puppet] - 10https://gerrit.wikimedia.org/r/950166 (owner: 10Btullis) [12:52:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950167 (owner: 10Muehlenhoff) [13:12:58] (03CR) 10Btullis: Add the members package to base::standard_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950166 (owner: 10Btullis) [13:13:12] (03Abandoned) 10Btullis: Add the members package to base::standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/950166 (owner: 10Btullis) [13:22:40] (03CR) 10Joal: [C: 03+1] "Thank you for the comment @btullis - this all makes sense :)" [puppet] - 10https://gerrit.wikimedia.org/r/950157 (https://phabricator.wikimedia.org/T328457) (owner: 10Btullis) [13:32:46] (03CR) 10JHathaway: [C: 03+1] "+1, is it worth trying to pare down the list of supported file systems more?" [puppet] - 10https://gerrit.wikimedia.org/r/950145 (owner: 10Muehlenhoff) [13:33:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/950167 (owner: 10Muehlenhoff) [13:34:26] (03PS1) 10Andrew Bogott: wmcs-backups: correct commandline for volume backup [puppet] - 10https://gerrit.wikimedia.org/r/950172 [13:43:46] (03PS2) 10Urbanecm: [beta] Growth: Enable user research opt-in checkbox on few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949849 (https://phabricator.wikimedia.org/T342353) [13:49:49] (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:53:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:20] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:51] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backups: correct commandline for volume backup [puppet] - 10https://gerrit.wikimedia.org/r/950172 (owner: 10Andrew Bogott) [14:03:08] (03PS1) 10Andrew Bogott: wmcs-backups: When cleaning unhandled vm backups, don't delete volume backups [puppet] - 10https://gerrit.wikimedia.org/r/950174 [14:03:53] (03PS21) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [14:04:20] RECOVERY - PyBal IPVS diff check on lvs3010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:04:56] (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:05:35] (03PS5) 10David Caro: dns::dotls: expose and gather haproxy internal metrics [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) [14:05:37] (03CR) 10David Caro: dns::dotls: expose and gather haproxy internal metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [14:05:50] PROBLEM - PyBal backends health check on lvs3010 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb_80: Servers cp3077.esams.wmnet, cp3075.esams.wmnet, cp3079.esams.wmnet, cp3081.esams.wmnet are marked down but pooled: testlb_80: Servers cp3067.esams.wmnet, cp3073.esams.wmnet, cp3069.esams.wmnet, cp3071.esams.wmnet are marked down but pooled: testlb_443: Servers cp3067.esams.wmnet, cp3073.esams.wmnet, cp3069.esams.wmnet, cp3071.esams.wmnet [14:05:50] ked down but pooled: uploadlb_443: Servers cp3077.esams.wmnet, cp3075.esams.wmnet, cp3079.esams.wmnet, cp3081.esams.wmnet are marked down but pooled: textlb_80: Servers cp3067.esams.wmnet, cp3073.esams.wmnet, cp3069.esams.wmnet, cp3071.esams.wmnet are marked down but pooled: textlb_443: Servers cp3067.esams.wmnet, cp3073.esams.wmnet, cp3069.esams.wmnet, cp3071.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBa [14:05:56] RECOVERY - PyBal IPVS diff check on lvs3009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:06:12] PROBLEM - PyBal backends health check on lvs3008 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_80: Servers cp3067.esams.wmnet, cp3073.esams.wmnet, cp3069.esams.wmnet, cp3071.esams.wmnet are marked down but pooled: testlb_443: Servers cp3067.esams.wmnet, cp3073.esams.wmnet, cp3069.esams.wmnet, cp3071.esams.wmnet are marked down but pooled: textlb_443: Servers cp3067.esams.wmnet, cp3073.esams.wmnet, cp3069.esams.wmnet, cp3071.esams.wmnet [14:06:12] ed down but pooled: textlb_80: Servers cp3067.esams.wmnet, cp3073.esams.wmnet, cp3069.esams.wmnet, cp3071.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:06:38] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:07:18] RECOVERY - PyBal IPVS diff check on lvs3008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:09:55] (ConfdResourceFailed) firing: (136) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:10:15] (03PS1) 10Herron: thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/950072 (https://phabricator.wikimedia.org/T343987) [14:11:40] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:41] (03PS11) 10JMeybohm: k8s: Reserve system resources on k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/949843 (https://phabricator.wikimedia.org/T277876) [14:12:45] (03CR) 10Herron: [C: 03+2] "proceeding with a partial deploy to codfw (puppet agents disabled in eqiad) for tegola debugging" [puppet] - 10https://gerrit.wikimedia.org/r/950072 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [14:12:57] (03PS22) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [14:14:39] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42926/console" [puppet] - 10https://gerrit.wikimedia.org/r/949843 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm) [14:15:21] (03PS1) 10Zabe: noc: Disclose langlist-labs to noc.wm.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950175 [14:15:55] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:15:59] (03PS23) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [14:16:28] (03PS1) 10Ssingh: Repool esams after knams migration (merge on Monday Aug 21) [dns] - 10https://gerrit.wikimedia.org/r/950176 (https://phabricator.wikimedia.org/T329219) [14:16:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:16:40] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:13] (03CR) 10Ssingh: "Depends-On added. Merge I64a2d9a86028f4f2b98265d94666f11d59666f71 before merging this." [dns] - 10https://gerrit.wikimedia.org/r/950176 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [14:18:36] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:21:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:22:24] (03PS1) 10Clément Goubert: mediawiki: Reduce memory request [deployment-charts] - 10https://gerrit.wikimedia.org/r/950177 (https://phabricator.wikimedia.org/T342748) [14:24:14] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [14:25:42] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:25:57] (03PS2) 10Clément Goubert: mediawiki: Reduce memory request [deployment-charts] - 10https://gerrit.wikimedia.org/r/950177 (https://phabricator.wikimedia.org/T342748) [14:29:37] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [14:30:32] (03CR) 10Clément Goubert: [C: 03+1] k8s: Reserve system resources on k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/949843 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm) [14:30:52] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:30:58] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:31:38] (03CR) 10Zabe: [C: 03+1] SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung) [14:32:02] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:32:02] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:32:02] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:32:04] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:37:34] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.358 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:37:38] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 4.900 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:37:50] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.328 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:37:56] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:38:12] (03PS24) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [14:38:14] (03CR) 10David Caro: replica_cnf_api: add envvars backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:38:58] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:39:02] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:40:44] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:42:51] (03PS25) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [14:43:02] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backups: When cleaning unhandled vm backups, don't delete volume backups [puppet] - 10https://gerrit.wikimedia.org/r/950174 (owner: 10Andrew Bogott) [14:45:21] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:47:40] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10mpopov) Lovely! Thanks so much @Clement_Goubert! @OSefu-WMF: I recommend using the SSH config template provided at: https://w... [14:56:54] (03CR) 10David Caro: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:58:22] (03PS1) 10Phuedx: Disable EchoMail and EchoInteraction instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950180 (https://phabricator.wikimedia.org/T344167) [14:58:30] (03CR) 10David Caro: "There's something weird going on with the toolforge-weld install, it seems to be pulling the wrong commit, will have to wait anyhow for th" [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [15:02:50] (03CR) 10Stevemunene: [C: 03+1] Update the default user role in Superset to be 'WMF Analyst' [puppet] - 10https://gerrit.wikimedia.org/r/950157 (https://phabricator.wikimedia.org/T328457) (owner: 10Btullis) [15:08:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [15:09:52] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [15:10:19] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [15:13:58] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) @Kelson the allowed list policy will end by the end of September. I was assuming you already got access to it, is this... [15:16:14] (03PS3) 10Jbond: httpyaml: replace URI.escape [puppet] - 10https://gerrit.wikimedia.org/r/919291 (https://phabricator.wikimedia.org/T330490) [15:23:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:00] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344353 (10Jhancock.wm) a:03Jhancock.wm refer to https://phabricator.wikimedia.org/T344110 [15:26:02] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:16] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:17] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:28:50] !log ipvsadm -Dt IPs in 91.198.174.0/24 IPs from A:lvs and A:esams [15:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding new host kubernetes2025 to CODFW - jhancock@cumin2002" [15:31:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding new host kubernetes2025 to CODFW - jhancock@cumin2002" [15:31:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:31:41] (03PS1) 10JMeybohm: modules/base: Copy networkpolicy_1.0.0 to networkpolicy_1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/950186 (https://phabricator.wikimedia.org/T344177) [15:31:43] (03PS1) 10JMeybohm: modules/base: networkpolicy_1.0.1 Add support for extraRules [deployment-charts] - 10https://gerrit.wikimedia.org/r/950187 (https://phabricator.wikimedia.org/T344177) [15:31:45] (03PS1) 10JMeybohm: wikifunctions: Fix networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/950188 (https://phabricator.wikimedia.org/T344177) [15:31:47] (03PS1) 10JMeybohm: admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/950189 (https://phabricator.wikimedia.org/T344177) [15:32:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [15:36:22] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 3/6 UP : OSPFv3: 3/3 UP : 6 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:36:36] PROBLEM - BFD status on cr1-esams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:36:51] topranks: ^ cr1-esams? [15:37:03] (03CR) 10Herron: prometheus: Add recording rules for istio traffic on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [15:42:07] * topranks looking [15:42:35] thanks! [15:43:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2025.mgmt.codfw.wmnet with reboot policy FORCED [16:07:53] (03PS2) 10Hnowlan: rest-gateway: add varnish- and trafficserver-side mangling to rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/949846 (https://phabricator.wikimedia.org/T344358) [16:21:39] (03PS1) 10Btullis: Retain yarn logs for 60 days and compress with gzip [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) [16:22:02] (03CR) 10CI reject: [V: 04-1] Retain yarn logs for 60 days and compress with gzip [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) (owner: 10Btullis) [16:22:54] (03PS2) 10Btullis: Retain yarn logs for 60 days and compress with gzip [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) [16:23:03] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) (owner: 10Btullis) [16:25:17] (03PS3) 10Hnowlan: rest-gateway: add varnish- and trafficserver-side mangling to rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/949846 (https://phabricator.wikimedia.org/T344358) [16:28:50] (03PS3) 10Btullis: Retain yarn logs for 60 days and compress with gzip [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) [16:29:02] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) (owner: 10Btullis) [16:30:47] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1010.eqiad.wmnet with OS bullseye [16:36:26] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1010.eqiad.wmnet with OS bullseye [16:36:45] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10OSefu-WMF) Thanks so much all and many thanks for the walkthrough @mpopov! [16:36:50] (03PS4) 10Btullis: Retain yarn logs for 60 days and compress with gzip [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) [16:37:36] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) (owner: 10Btullis) [16:38:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:38] (03PS1) 10Andrew Bogott: cinder backups: exclude some projects with giant but ephemeral volumes [puppet] - 10https://gerrit.wikimedia.org/r/950193 [16:42:26] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:04] RECOVERY - Check systemd state on wdqs2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:42] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:33] (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: exclude some projects with giant but ephemeral volumes [puppet] - 10https://gerrit.wikimedia.org/r/950193 (owner: 10Andrew Bogott) [16:51:35] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk2002.codfw.wmnet [16:56:19] !log bking@cumin1001 START - Cookbook sre.dns.netbox [17:06:30] 10SRE, 10Infrastructure-Foundations, 10Traffic: LVS servers using autoconf SLAAC IPv6 addresses - https://phabricator.wikimedia.org/T336505 (10cmooney) I note also this means there are multiple default routes in place for LVS nodes, as they accept RAs from connected switches. This makes return IPv6 packets... [17:07:58] (03PS1) 10Herron: Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/950073 (https://phabricator.wikimedia.org/T343987) [17:09:08] (03CR) 10Herron: [C: 03+2] Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/950073 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [17:09:21] (03PS1) 10Btullis: Grant analytics-admins rights to run some git cmds as analytics-deploy [puppet] - 10https://gerrit.wikimedia.org/r/950194 (https://phabricator.wikimedia.org/T334493) [17:10:47] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950194 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [17:11:17] (03PS2) 10Btullis: Grant analytics-admins rights to run some git cmds as analytics-deploy [puppet] - 10https://gerrit.wikimedia.org/r/950194 (https://phabricator.wikimedia.org/T334493) [17:11:27] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950194 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [17:12:02] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [17:13:45] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [17:13:45] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:45] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flink-zk2002.codfw.wmnet [17:18:10] !log bking@cumin1001 temporarily enabling alerts for flink-zk hosts to see if they work T341792 [17:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:14] T341792: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 [17:18:49] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for flink-zk[2001,2003].codfw.wmnet,flink-zk[1001-1003].eqiad.wmnet [17:18:51] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for flink-zk[2001,2003].codfw.wmnet,flink-zk[1001-1003].eqiad.wmnet [17:25:06] !log bking@ganeti1024 shutting off flink-zk1001 to check alerting T341792 [17:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:11] T341792: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 [17:26:34] PROBLEM - Host flink-zk1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:31:48] RECOVERY - Host flink-zk1001 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [17:32:48] RECOVERY - PyBal backends health check on lvs3008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:32:59] :D [17:33:01] topranks: ^ [17:33:24] haha nice [17:33:29] I feared another alert :P [17:37:52] !log bouncing OSPF on cr1-esams to attempt to resolve BFD/OSPF glitch [17:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:54] !log reboot LVSes in esams to flush broken IPv6 routes [17:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs[3008-3009].esams.wmnet with reason: rebooting to flush broken IPv6 routes [17:40:12] PROBLEM - Host asw1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [17:40:12] PROBLEM - Host asw1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [17:40:18] oh?? [17:40:18] PROBLEM - Host ncredir3004 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:22] RECOVERY - BFD status on cr2-eqiad is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:40:27] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs[3008-3009].esams.wmnet with reason: rebooting to flush broken IPv6 routes [17:40:28] PROBLEM - Host prometheus3003 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:36] PROBLEM - Host ncredir3003 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:36] topranks: ^ [17:40:40] PROBLEM - Host netflow3003 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:55] yeah expect more of that [17:40:59] * topranks on it [17:41:04] ok thank you [17:41:09] I will hold off on the reboots just in case then [17:41:39] PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [17:42:18] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:42:24] PROBLEM - HTTP on install3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Install_servers [17:42:32] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:42:38] RECOVERY - Host ncredir3004 is UP: PING WARNING - Packet loss = 33%, RTA = 79.96 ms [17:42:40] RECOVERY - Host asw1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 87.38 ms [17:42:40] RECOVERY - Host asw1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 88.35 ms [17:42:43] RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 80.41 ms [17:42:46] PROBLEM - BFD status on cr1-esams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:42:48] RECOVERY - Host netflow3003 is UP: PING OK - Packet loss = 0%, RTA = 78.94 ms [17:42:50] RECOVERY - Host prometheus3003 is UP: PING OK - Packet loss = 0%, RTA = 80.32 ms [17:42:54] RECOVERY - Host ncredir3003 is UP: PING OK - Packet loss = 0%, RTA = 79.95 ms [17:43:24] PROBLEM - PyBal backends health check on lvs3010 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3067.esams.wmnet, cp3071.esams.wmnet are marked down but pooled: testlb6_80: Servers cp3067.esams.wmnet, cp3071.esams.wmnet are marked down but pooled: uploadlb6_80: Servers cp3075.esams.wmnet, cp3079.esams.wmnet are marked down but pooled: uploadlb6_443: Servers cp3079.esams.wmnet, cp3081.esams.wmnet are marked down but pooled: [17:43:24] _443: Servers cp3073.esams.wmnet, cp3069.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3069.esams.wmnet, cp3071.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:43:32] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 3/6 UP : OSPFv3: 3/3 UP : 6 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:43:42] RECOVERY - HTTP on install3003 is OK: HTTP OK: HTTP/1.1 200 OK - 244 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Install_servers [17:43:46] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:43:50] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3066 is OK: HTTP OK: HTTP/1.1 200 Ok - 46738 bytes in 0.340 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:46:06] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:49:57] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1010'] [17:50:11] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1010'] [17:54:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs3010.esams.wmnet [17:58:36] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:59:03] yeah [18:00:06] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:00:40] RECOVERY - PyBal backends health check on lvs3010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:00:45] nice [18:01:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3010.esams.wmnet [18:02:34] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs3009.esams.wmnet [18:08:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3009.esams.wmnet [18:09:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs[3008-3009].esams.wmnet [18:09:45] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs[3008-3009].esams.wmnet [18:10:06] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:17:42] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is OK: 2 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [18:17:54] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1007 is OK: 0 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [18:18:48] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1006 is OK: 0 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [18:33:56] 10SRE-Access-Requests, 10DBA: mariadb: grant user 'phstats' additional select on phabricator_repository DB - https://phabricator.wikimedia.org/T344513 (10Aklapper) 05Open→03Stalled p:05Triage→03Low [18:40:51] (03CR) 10Xcollazo: Retain yarn logs for 60 days and compress with gzip (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) (owner: 10Btullis) [19:04:56] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10RLazarus) Only two blockers were raised at the August 7 SRE meeting: * **Training/docs:** We should make sure that anyone who can use Klaxon... [19:08:45] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10jhathaway) One issue that I raised, but perhaps was not captured anywhere is adding some guidance to the documentation on how the folks being... [19:11:17] 10ops-knams, 10Documentation: Update on-wiki documentation about esams - https://phabricator.wikimedia.org/T344129 (10Aklapper) [19:14:00] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10taavi) Is keeping an LDAP group up-to-date with all the stewards something the new IDM could possibly do in the future? [19:24:54] 10SRE, 10serviceops, 10MediaWiki-Platform-Team (Radar): Evaluate using igbinary for MW php-apcu at WMF - https://phabricator.wikimedia.org/T225074 (10Krinkle) [19:25:05] 10SRE, 10serviceops, 10Performance-Team (Radar): Evaluate using igbinary for MW php-apcu at WMF - https://phabricator.wikimedia.org/T225074 (10Krinkle) [19:28:51] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [19:30:33] 10SRE, 10SRE-Access-Requests, 10DBA: mariadb: grant user 'phstats' additional select on phabricator_repository DB - https://phabricator.wikimedia.org/T344513 (10Marostegui) Is this needed? Asking because I see you stalled it. [19:31:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:32:23] 10SRE, 10SRE-Access-Requests, 10DBA: mariadb: grant user 'phstats' additional select on phabricator_repository DB - https://phabricator.wikimedia.org/T344513 (10Aklapper) I just need to write the patch, then I'll pass it to you(s). :) [19:32:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:33:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:36:41] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) [19:39:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Jhancock.wm) [19:39:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:39:40] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:40:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:38:58] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10RLazarus) >>! In T343377#9102295, @jhathaway wrote: > One issue that I raised, but perhaps was not captured anywhere is adding some guidance... [20:45:36] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10jhathaway) >>! In T343377#9102709, @RLazarus wrote: >>>! In T343377#9102295, @jhathaway wrote: >> One issue that I raised, but perhaps was no... [21:06:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:07:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:08:26] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 4.472 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:09:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.295 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:11:36] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10RLazarus) I'd want @CDanis to weigh in on that, since it's really a Klaxon design decision, but personally I don't think a required field is... [21:20:54] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 3/6 UP : OSPFv3: 3/3 UP : 6 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:22:36] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 78, down: 20, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:28:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:33:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:10:06] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:27:40] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10odimitrijevic) I approve [22:38:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:18] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state