[00:04:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [00:05:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [00:09:44] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:10:56] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:20:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:22:34] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/999014 [00:38:52] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/999014 (owner: 10TrainBranchBot) [00:46:38] (03PS1) 10Jdlrobson: color-link-visited was not defined [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998974 (https://phabricator.wikimedia.org/T356928) [00:47:44] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:03:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/999014 (owner: 10TrainBranchBot) [01:22:28] (SystemdUnitFailed) firing: (2) update-tails-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:51] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:45:20] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10SnowFire) At risk of scope creep... while we're here, it'd be nice to also re-examine gallery sizes. See this conversation on... [01:50:22] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:51:34] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:04:35] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/999180 (https://phabricator.wikimedia.org/T352583) [02:06:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:07:58] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:11:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:12:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:22:28] (SystemdUnitFailed) firing: (2) update-tails-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:41] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10Base) As a random rant, perhaps it is my myopia speaking, but I would really love to have bigger default sizes across Wikipedias... [02:39:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:23] (03CR) 10Dzahn: [C: 03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/999180 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [02:46:30] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/999180 (https://phabricator.wikimedia.org/T352583) (owner: 10DDesouza) [02:59:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [02:59:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [02:59:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [03:00:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [03:00:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T352010)', diff saved to https://phabricator.wikimedia.org/P56572 and previous config saved to /var/cache/conftool/dbconfig/20240209-030028-ladsgroup.json [03:00:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:09:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:21:27] (03PS1) 10Andrew Bogott: Add wmcs-empty-rbd-trash script [puppet] - 10https://gerrit.wikimedia.org/r/999218 (https://phabricator.wikimedia.org/T356904) [03:22:46] (03CR) 10CI reject: [V: 04-1] Add wmcs-empty-rbd-trash script [puppet] - 10https://gerrit.wikimedia.org/r/999218 (https://phabricator.wikimedia.org/T356904) (owner: 10Andrew Bogott) [03:23:45] (03PS2) 10Andrew Bogott: Add wmcs-empty-rbd-trash script [puppet] - 10https://gerrit.wikimedia.org/r/999218 (https://phabricator.wikimedia.org/T356904) [03:25:01] (03CR) 10CI reject: [V: 04-1] Add wmcs-empty-rbd-trash script [puppet] - 10https://gerrit.wikimedia.org/r/999218 (https://phabricator.wikimedia.org/T356904) (owner: 10Andrew Bogott) [03:26:47] (03PS3) 10Andrew Bogott: Add wmcs-empty-rbd-trash script [puppet] - 10https://gerrit.wikimedia.org/r/999218 (https://phabricator.wikimedia.org/T356904) [04:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:27:00] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:31:45] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:49:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:50:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:35:51] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:48:07] !log dbmaint Schema change on s7@codfw T357067 [05:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:11] T357067: Update default values in globalblocks table - https://phabricator.wikimedia.org/T357067 [05:52:57] (03PS1) 10Marostegui: es2030: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/999327 [05:57:32] (03CR) 10Marostegui: [C: 03+2] es2030: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/999327 (owner: 10Marostegui) [06:06:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T352010)', diff saved to https://phabricator.wikimedia.org/P56573 and previous config saved to /var/cache/conftool/dbconfig/20240209-060605-ladsgroup.json [06:06:10] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:21:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P56574 and previous config saved to /var/cache/conftool/dbconfig/20240209-062111-ladsgroup.json [06:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:22:29] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:28:09] (03PS1) 10Marostegui: mariadb: Decommission db1124 [puppet] - 10https://gerrit.wikimedia.org/r/999346 (https://phabricator.wikimedia.org/T334388) [06:28:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:29:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1124.eqiad.wmnet [06:33:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:34:57] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [06:36:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P56575 and previous config saved to /var/cache/conftool/dbconfig/20240209-063618-ladsgroup.json [06:36:29] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1124 [puppet] - 10https://gerrit.wikimedia.org/r/999346 (https://phabricator.wikimedia.org/T334388) (owner: 10Marostegui) [06:36:59] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1124.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:38:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1124.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:38:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:38:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1124.eqiad.wmnet [06:39:51] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1124.eqiad.wmnet - https://phabricator.wikimedia.org/T334388 (10Marostegui) This is ready for #dc-ops [06:39:54] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1124.eqiad.wmnet - https://phabricator.wikimedia.org/T334388 (10Marostegui) a:05Marostegui→03None [06:43:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:45:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:51:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T352010)', diff saved to https://phabricator.wikimedia.org/P56576 and previous config saved to /var/cache/conftool/dbconfig/20240209-065125-ladsgroup.json [06:51:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [06:51:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:51:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [06:51:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T352010)', diff saved to https://phabricator.wikimedia.org/P56577 and previous config saved to /var/cache/conftool/dbconfig/20240209-065147-ladsgroup.json [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240209T0700) [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:06:53] (03PS1) 10Giuseppe Lavagetto: mobileapps:update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/999376 [07:08:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] benthos: upgrade to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998852 (owner: 10Giuseppe Lavagetto) [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:46] (03Merged) 10jenkins-bot: benthos: upgrade to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998852 (owner: 10Giuseppe Lavagetto) [07:29:59] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:31:09] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:39:09] (03CR) 10Filippo Giunchedi: [C: 03+1] Onboard the data-platform-sre team to Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/989900 (https://phabricator.wikimedia.org/T342578) (owner: 10Bking) [07:41:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "Good to go" [puppet] - 10https://gerrit.wikimedia.org/r/994735 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240209T0800) [08:00:31] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:07:51] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Decommission puppetmaster1002 - https://phabricator.wikimedia.org/T357093 (10MoritzMuehlenhoff) [08:08:28] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Decommission puppetmaster1002 - https://phabricator.wikimedia.org/T357093 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [08:13:48] (03PS1) 10Muehlenhoff: Remove puppetmaster2003 from puppetdb config [puppet] - 10https://gerrit.wikimedia.org/r/999463 (https://phabricator.wikimedia.org/T356991) [08:15:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove puppetmaster2003 from puppetdb config [puppet] - 10https://gerrit.wikimedia.org/r/999463 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff) [08:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:27:00] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:29:54] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster2003.codfw.wmnet [08:31:45] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:35:42] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:37:23] (03PS1) 10Arnaudb: mariadb: revert db2194 [puppet] - 10https://gerrit.wikimedia.org/r/999015 (https://phabricator.wikimedia.org/T343674) [08:37:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:39:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:39:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:39:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster2003.codfw.wmnet [08:39:27] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetmaster2003.codfw.wmnet` - puppe... [08:39:56] (03CR) 10Marostegui: [C: 03+1] mariadb: revert db2194 [puppet] - 10https://gerrit.wikimedia.org/r/999015 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:45:21] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Jelto) 05In progress→03Resolved a:03Jelto Great! I'll resolve this task, all access should be available again. Feel free to reopen the ticket if... [08:47:16] 10ops-codfw: Relabel puppetmaster2003 - https://phabricator.wikimedia.org/T357096 (10MoritzMuehlenhoff) [08:52:11] (03CR) 10Arnaudb: [C: 03+2] mariadb: revert db2194 [puppet] - 10https://gerrit.wikimedia.org/r/999015 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:56:12] (03CR) 10Filippo Giunchedi: "Thank you Jesse, I agree re: not the best way, going forward perhaps we can devise a solution to create tasks instead. Simon, I believe th" [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:57:20] (03CR) 10Filippo Giunchedi: "Cleaning up review queue, add me again when the time comes!" [puppet] - 10https://gerrit.wikimedia.org/r/982086 (https://phabricator.wikimedia.org/T349626) (owner: 10LSobanski) [08:58:08] (03PS1) 10Slyngshede: P:package_builder: Limit TCP check to IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/999515 [09:02:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "Prometheus part LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [09:03:24] (03CR) 10Filippo Giunchedi: [C: 03+1] P:package_builder: Limit TCP check to IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/999515 (owner: 10Slyngshede) [09:08:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2194.codfw.wmnet with OS bookworm [09:17:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097 (10ElineWMDE) [09:28:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [09:30:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10Jelto) [09:32:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [09:33:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10Jelto) Great, thanks for signing the L3. And you are right, no SSH key is needed. Let's wait for a approval from a `analytics-privatedata-users` owner (@odimitrijevic , @WD... [09:35:52] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:37:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T352010)', diff saved to https://phabricator.wikimedia.org/P56578 and previous config saved to /var/cache/conftool/dbconfig/20240209-093754-ladsgroup.json [09:38:02] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:39:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097 (10Jelto) p:05Triage→03Medium [09:44:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097 (10Jelto) Thanks for opening the access request. We need additional approval from a analytics-privatedata-users owner (@odimitrijevic , @WDoranWMF, @Ahoelzl @Milimetric) a... [09:45:43] 10SRE, 10ops-codfw: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357100 (10ops-monitoring-bot) [09:46:26] !log uploaded openjdk-8 8u402-ga-2~deb10u1 for buster-wikimedia (backport of latest Java 8 security updates) [09:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:30] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357100 (10Peachey88) [09:49:31] (ProbeDown) resolved: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:50:13] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357100 (10ABran-WMF) 05Open→03Declined p:05Triage→03Medium [09:53:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P56579 and previous config saved to /var/cache/conftool/dbconfig/20240209-095301-ladsgroup.json [09:54:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2194.codfw.wmnet with OS bookworm [09:54:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097 (10Jelto) 05Open→03In progress [10:08:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P56580 and previous config saved to /var/cache/conftool/dbconfig/20240209-100808-ladsgroup.json [10:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:22:45] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T352010)', diff saved to https://phabricator.wikimedia.org/P56581 and previous config saved to /var/cache/conftool/dbconfig/20240209-102314-ladsgroup.json [10:23:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:23:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:23:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:23:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T352010)', diff saved to https://phabricator.wikimedia.org/P56582 and previous config saved to /var/cache/conftool/dbconfig/20240209-102336-ladsgroup.json [10:24:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:25:32] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:27:00] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:31:45] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:32:00] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:36:45] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:46:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) [10:48:13] (03PS10) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [10:48:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10Gehel) [10:50:15] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/999561 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [10:54:54] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:56:00] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:02:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10BTullis) In case it helps, we saw the same hardware error recently on a server in codfw. T355830#9517443 @Jhancock.wm was able to fix it... [11:23:44] (03CR) 10Clément Goubert: [C: 03+2] eventstreams: Raise memory limit to 1100Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/998945 (https://phabricator.wikimedia.org/T357005) (owner: 10Clément Goubert) [11:24:38] (03Merged) 10jenkins-bot: eventstreams: Raise memory limit to 1100Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/998945 (https://phabricator.wikimedia.org/T357005) (owner: 10Clément Goubert) [11:25:34] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [11:26:10] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [11:30:03] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [11:30:22] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [11:31:24] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [11:31:42] (03CR) 10Filippo Giunchedi: [C: 03+1] Change all role contacts for Data Engineering -> Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/999561 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [11:32:06] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [11:32:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:35:02] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097 (10Virginie.caplet) In my quality of approving party (Head of UX, manager of @ElineWMDE) I approve! :) [11:39:24] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [11:39:39] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [11:40:43] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [11:41:20] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [11:41:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:42:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:42:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P56583 and previous config saved to /var/cache/conftool/dbconfig/20240209-114208-ladsgroup.json [11:42:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:51:41] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/999570 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [11:57:07] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/999638 (https://phabricator.wikimedia.org/T356736) [12:04:12] !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be1044.eqiad.wmnet [12:09:02] (03CR) 10Btullis: [C: 03+2] Change all role contacts for Data Engineering -> Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/999561 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [12:14:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be1044.eqiad.wmnet [12:15:25] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:20:25] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:24:30] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:24:40] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be1044 [12:25:25] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:25:38] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:26:38] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1042.mgmt.eqiad.wmnet with reboot policy FORCED [12:26:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1041.mgmt.eqiad.wmnet with reboot policy FORCED [12:26:41] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1040.mgmt.eqiad.wmnet with reboot policy FORCED [12:26:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1039.mgmt.eqiad.wmnet with reboot policy FORCED [12:26:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1038.mgmt.eqiad.wmnet with reboot policy FORCED [12:26:45] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:26:46] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1037.mgmt.eqiad.wmnet with reboot policy FORCED [12:26:59] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1039.mgmt.eqiad.wmnet with reboot policy FORCED [12:27:05] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1037.mgmt.eqiad.wmnet with reboot policy FORCED [12:27:12] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1040.mgmt.eqiad.wmnet with reboot policy FORCED [12:27:13] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1041.mgmt.eqiad.wmnet with reboot policy FORCED [12:28:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1041.mgmt.eqiad.wmnet with reboot policy FORCED [12:30:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1040.mgmt.eqiad.wmnet with reboot policy FORCED [12:30:25] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1039.mgmt.eqiad.wmnet with reboot policy FORCED [12:31:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1037.mgmt.eqiad.wmnet with reboot policy FORCED [12:31:42] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1037.mgmt.eqiad.wmnet with reboot policy FORCED [12:31:45] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:31:50] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/999638 (https://phabricator.wikimedia.org/T356736) (owner: 10STran) [12:32:24] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update multilingual revertrisk image [deployment-charts] - 10https://gerrit.wikimedia.org/r/999570 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [12:32:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1037.mgmt.eqiad.wmnet with reboot policy FORCED [12:32:47] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/999638 (https://phabricator.wikimedia.org/T356736) (owner: 10STran) [12:33:20] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:33:26] (03Merged) 10jenkins-bot: ml-services: update multilingual revertrisk image [deployment-charts] - 10https://gerrit.wikimedia.org/r/999570 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [12:35:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1035.mgmt.eqiad.wmnet with reboot policy FORCED [12:35:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1034.mgmt.eqiad.wmnet with reboot policy FORCED [12:35:38] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2003 rename - jmm@cumin2002" [12:36:00] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:36:24] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:36:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2003 rename - jmm@cumin2002" [12:36:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:38:30] (03CR) 10Stevemunene: [C: 03+1] service: register superset and superset-next under ingress [puppet] - 10https://gerrit.wikimedia.org/r/997857 (https://phabricator.wikimedia.org/T356483) (owner: 10Brouberol) [12:39:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] network: rename labs_networks as cloud_networks [puppet] - 10https://gerrit.wikimedia.org/r/998445 (owner: 10Majavah) [12:39:28] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:39:56] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:41:56] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10Ladsgroup) >>! In T355914#9527616, @SnowFire wrote: > At risk of scope creep... while we're here, it'd be nice to also re-exami... [12:42:02] (03PS1) 10Muehlenhoff: Add puppetserver2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/999695 (https://phabricator.wikimedia.org/T356991) [12:43:04] (03PS1) 10Muehlenhoff: Remove site.pp entry for old Puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/999696 (https://phabricator.wikimedia.org/T347286) [12:43:14] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be1044 [12:44:13] (03CR) 10Stevemunene: [C: 03+1] superset: setup dyna mapping rules [puppet] - 10https://gerrit.wikimedia.org/r/997858 (https://phabricator.wikimedia.org/T356481) (owner: 10Brouberol) [12:44:33] (03PS1) 10Clément Goubert: linkrecommendation-internal: Raise memory requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/999698 (https://phabricator.wikimedia.org/T357122) [12:44:55] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:44:58] (03CR) 10Majavah: [V: 03+1 C: 03+2] network: rename labs_networks as cloud_networks [puppet] - 10https://gerrit.wikimedia.org/r/998445 (owner: 10Majavah) [12:45:02] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:45:43] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:47:12] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:47:28] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:47:31] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:48:03] (03CR) 10Muehlenhoff: [C: 03+2] Add puppetserver2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/999695 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff) [12:48:15] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [12:49:07] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10Ladsgroup) >>! In T355914#9527640, @Base wrote: > As a random rant, perhaps it is my myopia speaking, but I would really love to... [12:49:09] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [12:49:21] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 (10Ladsgroup) [12:49:52] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [12:50:43] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [12:53:25] (SystemdUnitFailed) firing: ferm.service Failed on mw2357:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:46] (03CR) 10Muehlenhoff: [C: 03+2] Remove site.pp entry for old Puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/999696 (https://phabricator.wikimedia.org/T347286) (owner: 10Muehlenhoff) [12:56:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:57:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1042.mgmt.eqiad.wmnet with reboot policy FORCED [12:57:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:57:56] (03CR) 10MVernon: "Hi," [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [12:58:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1039.mgmt.eqiad.wmnet with reboot policy FORCED [12:58:52] PROBLEM - Check whether ferm is active by checking the default input chain on mw2357 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:59:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1037.mgmt.eqiad.wmnet with reboot policy FORCED [12:59:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1038.mgmt.eqiad.wmnet with reboot policy FORCED [13:00:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1034.mgmt.eqiad.wmnet with reboot policy FORCED [13:00:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1041.mgmt.eqiad.wmnet with reboot policy FORCED [13:00:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1040.mgmt.eqiad.wmnet with reboot policy FORCED [13:00:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1035.mgmt.eqiad.wmnet with reboot policy FORCED [13:01:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr) [13:02:10] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [13:03:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:03:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:30] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1034'] [13:03:40] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase1034'] [13:05:14] (03PS1) 10Muehlenhoff: puppetdb: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/999715 [13:05:42] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for puppetserver2003 - cmooney@cumin1002" [13:06:31] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for puppetserver2003 - cmooney@cumin1002" [13:06:31] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:07:29] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1034'] [13:07:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase1034'] [13:07:53] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1039'] [13:07:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T352010)', diff saved to https://phabricator.wikimedia.org/P56584 and previous config saved to /var/cache/conftool/dbconfig/20240209-130755-ladsgroup.json [13:07:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase1039'] [13:08:00] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:11:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1034.eqiad.wmnet with OS bullseye [13:11:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1034.eqiad.wmnet with OS bullseye [13:14:25] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1035.eqiad.wmnet with OS bullseye [13:14:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1037.eqiad.wmnet with OS bullseye [13:14:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1038.eqiad.wmnet with OS bullseye [13:14:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1035.eqiad.wmnet with OS bullseye [13:14:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1037.eqiad.wmnet with OS bullseye [13:14:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1038.eqiad.wmnet with OS bullseye [13:15:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1039.eqiad.wmnet with OS bullseye [13:15:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1039.eqiad.wmnet with OS bullseye [13:16:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1040.eqiad.wmnet with OS bullseye [13:16:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1041.eqiad.wmnet with OS bullseye [13:16:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1040.eqiad.wmnet with OS bullseye [13:17:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1041.eqiad.wmnet with OS bullseye [13:17:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1042.eqiad.wmnet with OS bullseye [13:17:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1042.eqiad.wmnet with OS bullseye [13:18:25] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P56585 and previous config saved to /var/cache/conftool/dbconfig/20240209-132302-ladsgroup.json [13:25:31] !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2003.mgmt.codfw.wmnet with reboot policy FORCED [13:26:50] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1034.eqiad.wmnet with reason: host reimage [13:29:36] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1038.eqiad.wmnet with reason: host reimage [13:29:41] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1037.eqiad.wmnet with reason: host reimage [13:30:32] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1039.eqiad.wmnet with reason: host reimage [13:31:01] !log enabling BGP peering to NL-IX (new IXP connection) route servers from cr2-esams T322630 [13:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1034.eqiad.wmnet with reason: host reimage [13:32:04] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1042.eqiad.wmnet with reason: host reimage [13:32:08] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1041.eqiad.wmnet with reason: host reimage [13:32:15] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1040.eqiad.wmnet with reason: host reimage [13:34:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1039.eqiad.wmnet with reason: host reimage [13:36:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1038.eqiad.wmnet with reason: host reimage [13:37:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10Gehel) p:05Triage→03High [13:37:50] PROBLEM - Check whether ferm is active by checking the default input chain on mw2424 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:38:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P56586 and previous config saved to /var/cache/conftool/dbconfig/20240209-133809-ladsgroup.json [13:38:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2003.mgmt.codfw.wmnet with reboot policy FORCED [13:39:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1042.eqiad.wmnet with reason: host reimage [13:41:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] mobileapps:update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/999376 (owner: 10Giuseppe Lavagetto) [13:42:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1037.eqiad.wmnet with reason: host reimage [13:43:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] echoserver: update to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998853 (owner: 10Giuseppe Lavagetto) [13:43:24] (03PS1) 10Filippo Giunchedi: sre: have SystemdUnitFailed retry match Icinga's [alerts] - 10https://gerrit.wikimedia.org/r/999773 (https://phabricator.wikimedia.org/T357028) [13:44:08] (03Merged) 10jenkins-bot: echoserver: update to base.meta:2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/998853 (owner: 10Giuseppe Lavagetto) [13:44:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] flink-app: update modules to recent versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/998986 (owner: 10Giuseppe Lavagetto) [13:44:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1041.eqiad.wmnet with reason: host reimage [13:45:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] python-webapp: update module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/998988 (owner: 10Giuseppe Lavagetto) [13:45:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] spark-history: fix package.json, update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/998989 (owner: 10Giuseppe Lavagetto) [13:46:08] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:46:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] ipoid: upgrade to new modules versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/998987 (owner: 10Giuseppe Lavagetto) [13:47:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:47:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1040.eqiad.wmnet with reason: host reimage [13:47:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [13:48:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [13:50:44] (03PS1) 10Alexandros Kosiaris: mediawiki: Bump sextant module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/999780 [13:51:17] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10Jhancock.wm) [13:51:19] 10SRE, 10ops-codfw: Relabel puppetmaster2003 - https://phabricator.wikimedia.org/T357096 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm server has been relabeled [13:53:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T352010)', diff saved to https://phabricator.wikimedia.org/P56587 and previous config saved to /var/cache/conftool/dbconfig/20240209-135315-ladsgroup.json [13:53:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [13:53:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:53:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [13:53:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1210 (T352010)', diff saved to https://phabricator.wikimedia.org/P56588 and previous config saved to /var/cache/conftool/dbconfig/20240209-135337-ladsgroup.json [13:55:49] (03CR) 10Jgiannelos: Turn on Parsoid read views by default on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999060 (https://phabricator.wikimedia.org/T357054) (owner: 10C. Scott Ananian) [13:59:15] (03CR) 10Vgutierrez: [C: 04-1] Add module for ncmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [14:00:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [14:00:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [14:01:48] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:01:53] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10Jhancock.wm) @ABran-WMF I need to do some troubleshooting measures before Dell will replace the disk. is it safe for me to power down the server? It shouldn't be long and I can get the request out today. [14:02:41] (03PS1) 10Slyngshede: Monitoring of PKI infrastructure certs. [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) [14:02:47] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10ABran-WMF) @Jhancock.wm go for it! [14:03:10] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Jelto) I think we might have a problem with the newer `nodejs` version and Debian Bullseye. The release notes of etherpad-lite 1.9.5 state: > This version deprecates... [14:03:25] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:40] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:04:43] (03CR) 10Brouberol: [C: 03+2] Add superset/superset-next.svc.eqiad.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/995174 (https://phabricator.wikimedia.org/T356481) (owner: 10Brouberol) [14:05:24] (03PS3) 10Brouberol: Add superset/superset-next.svc.eqiad.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/995174 (https://phabricator.wikimedia.org/T356481) [14:05:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097 (10Jelto) [14:08:56] (03PS1) 10Esanders: MobileFrontend: Set fallback editor to 'visual' on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999813 [14:10:28] (03PS1) 10Andrea Denisse: grafana: Prevent race condition by excluding 'wal' directory in Loki sync [puppet] - 10https://gerrit.wikimedia.org/r/999820 (https://phabricator.wikimedia.org/T357026) [14:10:58] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10MoritzMuehlenhoff) >>! In T316421#9528954, @Jelto wrote: > As far as I can tell Bullseye only has `nodejs` version `12.22`. In our apt repo we also have `14.20` and `... [14:11:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] sre: have SystemdUnitFailed retry match Icinga's [alerts] - 10https://gerrit.wikimedia.org/r/999773 (https://phabricator.wikimedia.org/T357028) (owner: 10Filippo Giunchedi) [14:13:13] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1354/co" [puppet] - 10https://gerrit.wikimedia.org/r/999820 (https://phabricator.wikimedia.org/T357026) (owner: 10Andrea Denisse) [14:15:28] (03CR) 10Andrea Denisse: [V: 03+1] "Tested in Pontoon and PCC." [puppet] - 10https://gerrit.wikimedia.org/r/999820 (https://phabricator.wikimedia.org/T357026) (owner: 10Andrea Denisse) [14:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:16:51] (03CR) 10Clément Goubert: [C: 03+1] sre: have SystemdUnitFailed retry match Icinga's [alerts] - 10https://gerrit.wikimedia.org/r/999773 (https://phabricator.wikimedia.org/T357028) (owner: 10Filippo Giunchedi) [14:17:20] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: have SystemdUnitFailed retry match Icinga's [alerts] - 10https://gerrit.wikimedia.org/r/999773 (https://phabricator.wikimedia.org/T357028) (owner: 10Filippo Giunchedi) [14:19:30] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/999820 (https://phabricator.wikimedia.org/T357026) (owner: 10Andrea Denisse) [14:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:22:46] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:25] (SystemdUnitFailed) resolved: (2) ferm.service Failed on mw2357:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:55] (03PS2) 10Andrea Denisse: grafana: Prevent race condition by excluding 'wal' directory in Loki sync [puppet] - 10https://gerrit.wikimedia.org/r/999820 (https://phabricator.wikimedia.org/T357026) [14:24:43] (03CR) 10Andrea Denisse: grafana: Prevent race condition by excluding 'wal' directory in Loki sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999820 (https://phabricator.wikimedia.org/T357026) (owner: 10Andrea Denisse) [14:25:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "This is missing a receiver, breaking puppet on alerting hosts" [puppet] - 10https://gerrit.wikimedia.org/r/989900 (https://phabricator.wikimedia.org/T342578) (owner: 10Bking) [14:26:01] btullis: ^ [14:26:10] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1355/co" [puppet] - 10https://gerrit.wikimedia.org/r/999820 (https://phabricator.wikimedia.org/T357026) (owner: 10Andrea Denisse) [14:26:45] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:26:51] I'll send a followup patch [14:28:34] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133 (10MoritzMuehlenhoff) [14:28:44] (03PS1) 10Filippo Giunchedi: alertmanager: add data-engineering-mail missing receiver [puppet] - 10https://gerrit.wikimedia.org/r/999858 (https://phabricator.wikimedia.org/T342578) [14:28:57] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:29:22] RECOVERY - Check whether ferm is active by checking the default input chain on mw2357 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:30:23] (03CR) 10Btullis: [C: 03+1] "Thanks for catching this." [puppet] - 10https://gerrit.wikimedia.org/r/999858 (https://phabricator.wikimedia.org/T342578) (owner: 10Filippo Giunchedi) [14:31:40] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add data-engineering-mail missing receiver [puppet] - 10https://gerrit.wikimedia.org/r/999858 (https://phabricator.wikimedia.org/T342578) (owner: 10Filippo Giunchedi) [14:31:45] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:32:55] btullis: np, if you (or anyone) know of a simple way to get CI to expand the erb template I'd love to get validation for alertmanager.yml before we actually merge [14:33:12] 10SRE, 10ops-codfw: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357135 (10ops-monitoring-bot) [14:33:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! nice" [puppet] - 10https://gerrit.wikimedia.org/r/999820 (https://phabricator.wikimedia.org/T357026) (owner: 10Andrea Denisse) [14:34:41] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be20[44-50] - https://phabricator.wikimedia.org/T356878 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:34:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1035.eqiad.wmnet with OS bullseye [14:34:46] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10akosiaris) >>! In T316421#9528954, @Jelto wrote: > I think we might have a problem with the newer `nodejs` version and Debian Bullseye. The release notes of etherpad-... [14:34:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1035.eqiad.wmnet with OS bullseye executed with errors: - restbase1035 (**FA... [14:35:57] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] grafana: Prevent race condition by excluding 'wal' directory in Loki sync [puppet] - 10https://gerrit.wikimedia.org/r/999820 (https://phabricator.wikimedia.org/T357026) (owner: 10Andrea Denisse) [14:38:08] RECOVERY - Check whether ferm is active by checking the default input chain on mw2424 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:39:34] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:43] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10Marostegui) [14:42:07] 10SRE, 10ops-codfw: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357135 (10Marostegui) [14:47:01] (03CR) 10MVernon: "On a very cursory glance at redfish.py it looks like the JSON here violates a number of assumptions in poll_task - rather than an array of" [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:49:41] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2194 - https://phabricator.wikimedia.org/T357015 (10Jhancock.wm) as expected the drive did not come back with their recommended troubleshooting. Created a dispatch. SR184935290. Will notify when the disk is replaced. [14:50:34] (03CR) 10Brouberol: [C: 03+2] service: register superset and superset-next under ingress [puppet] - 10https://gerrit.wikimedia.org/r/997857 (https://phabricator.wikimedia.org/T356483) (owner: 10Brouberol) [14:58:48] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:34] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 64 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:05:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1005*,cloudelastic1006*,cloudelastic1007*,cloudelastic1008* for IP migration - bking@cumin2002 - T355617 [15:05:51] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1005*,cloudelastic1006*,cloudelastic1007*,cloudelastic1008* for IP migration - bking@cumin2002 - T355617 [15:06:02] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [15:08:13] (03PS1) 10Alexandros Kosiaris: service mesh: Listen on IPv6 too (copy patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/999866 [15:08:18] (03PS1) 10Alexandros Kosiaris: service mesh: Listen unconditionally on IPv6/IPv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) [15:09:08] (03CR) 10CI reject: [V: 04-1] service mesh: Listen unconditionally on IPv6/IPv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [15:09:42] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 9 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:10:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) I can't seem to access the idrac remotely. Is it okay if I power down the server at this time? [15:13:54] (03PS1) 10Alexandros Kosiaris: termbox: Bump module dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/999882 (https://phabricator.wikimedia.org/T255568) [15:15:05] (03CR) 10CI reject: [V: 04-1] termbox: Bump module dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/999882 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [15:19:47] (03PS2) 10Alexandros Kosiaris: service mesh: Listen unconditionally on IPv6/IPv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) [15:19:49] (03PS2) 10Alexandros Kosiaris: termbox: Bump module dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/999882 (https://phabricator.wikimedia.org/T255568) [15:21:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) >>! In T355333#9529271, @Jhancock.wm wrote: > I can't seem to access the idrac remotely. Is it okay if I power down the server at this time? I had some weirdness wh... [15:21:25] (03CR) 10CI reject: [V: 04-1] termbox: Bump module dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/999882 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [15:21:32] (03CR) 10CI reject: [V: 04-1] service mesh: Listen unconditionally on IPv6/IPv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [15:26:00] (03PS3) 10Alexandros Kosiaris: service mesh: Listen unconditionally on IPv6/IPv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/999867 (https://phabricator.wikimedia.org/T255568) [15:26:02] (03PS3) 10Alexandros Kosiaris: termbox: Bump module dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/999882 (https://phabricator.wikimedia.org/T255568) [15:29:06] (03PS11) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [15:29:51] (03CR) 10CI reject: [V: 04-1] mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [15:34:02] (03PS12) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [15:34:42] (03CR) 10CI reject: [V: 04-1] mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [15:37:06] (03PS1) 10Hashar: Bump javascript from es2018 to es2020 [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/999902 [15:38:54] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144 (10MoritzMuehlenhoff) [15:38:59] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:43:34] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) I reseated the NIC and it connected. when I rebooted it went down again and didn't come up. swapped it out and rebooted it. stayed up this time. should have repl... [15:44:50] 10SRE, 10Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872 (10MatthewVernon) Swift uses IP(v4) address (and then device name) as the identifier for entries in its rings. Additionally, when adding nodes to the ring, we use IP add... [15:45:04] 10SRE, 10SRE-swift-storage, 10Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872 (10MatthewVernon) [15:51:48] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:53:28] (03CR) 10Jforrester: [C: 03+1] Bump javascript from es2018 to es2020 [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/999902 (owner: 10Hashar) [15:58:24] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:58:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1035.eqiad.wmnet with OS bullseye [15:58:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1035.eqiad.wmnet with OS bullseye [15:59:03] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1035.eqiad.wmnet with OS bullseye [15:59:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1035.eqiad.wmnet with OS bullseye executed with errors: - restbase1035 (**FA... [15:59:38] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [15:59:50] (03PS1) 10Cathal Mooney: Add Hurricane Electric IPv6 transit over NL-IX [homer/public] - 10https://gerrit.wikimedia.org/r/999923 (https://phabricator.wikimedia.org/T322630) [15:59:57] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1035.mgmt.eqiad.wmnet with reboot policy FORCED [16:00:10] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10cmooney) 05Open→03Resolved a:03cmooney Thanks everyone for the help on getting this done! [16:01:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147 (10ArthurTaylor) [16:02:10] (03CR) 10Cathal Mooney: [C: 03+2] Add Hurricane Electric IPv6 transit over NL-IX [homer/public] - 10https://gerrit.wikimedia.org/r/999923 (https://phabricator.wikimedia.org/T322630) (owner: 10Cathal Mooney) [16:03:28] (03Merged) 10jenkins-bot: Add Hurricane Electric IPv6 transit over NL-IX [homer/public] - 10https://gerrit.wikimedia.org/r/999923 (https://phabricator.wikimedia.org/T322630) (owner: 10Cathal Mooney) [16:06:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147 (10karapayneWMDE) Approved by me, the EM for the wikidata team [16:13:35] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:17:44] Hey all is anyone able to do an emergency deploy? [16:17:54] because of the swift train rollout we didnt catch this bug on Wednesday: https://phabricator.wikimedia.org/T356928 and t's problematic from an accessibility point of view and has triggered at least 4 village pump discussions on different projects. [16:17:57] The patch is quite trivial: https://gerrit.wikimedia.org/r/c/998974 [16:18:39] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:18:43] * Lucas_WMDE is probably around for another hour and so and can run scap if the emergency deploy is approved [16:18:52] (s/and/or/ oops) [16:20:30] backporting that seems reasonable to me [16:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:21:17] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/999533 (https://phabricator.wikimedia.org/T357093) (owner: 10Muehlenhoff) [16:21:50] thcipriani and brennen: any objections to Jdlrobson’s emergency deploy? (CSS-only) [16:23:40] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:23:45] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/999715 (owner: 10Muehlenhoff) [16:26:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T352010)', diff saved to https://phabricator.wikimedia.org/P56590 and previous config saved to /var/cache/conftool/dbconfig/20240209-162643-ladsgroup.json [16:26:58] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:27:45] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:27:59] (03CR) 10Hashar: [C: 03+1] color-link-visited was not defined [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998974 (https://phabricator.wikimedia.org/T356928) (owner: 10Jdlrobson) [16:29:07] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:30:05] Lucas_WMDE: thcipriani is on a plane but he has approved in Slack [16:30:18] (doesn't have IRC access) [16:30:35] not sure if that's sufficient [16:31:10] (i sent you a screenscrab on your slack) [16:32:08] * Lucas_WMDE looks [16:32:36] alright, let’s do it then [16:32:45] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:33:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998974 (https://phabricator.wikimedia.org/T356928) (owner: 10Jdlrobson) [16:33:22] * Lucas_WMDE deploying [16:34:28] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:34:47] thanks Lucas_WMDE much appreciated :) [16:37:07] Jdlrobson, Lucas_WMDE: just caught up with scrollback - let me know if i can be useful. [16:37:25] brennen: good morning :) hopefully this will be very uneventful [16:39:58] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:41:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P56591 and previous config saved to /var/cache/conftool/dbconfig/20240209-164150-ladsgroup.json [16:43:41] (03PS1) 10Hashar: Gerrit 3.8 no more set redundant real_author [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/999928 (https://phabricator.wikimedia.org/T354886) [16:44:16] brennen: I have +1ed the patch in the name of releng :) [16:44:35] cause setting a less variable probably does not require any more paperwork than a cr+1 :] [16:45:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999088 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [16:48:13] hashar: yeah, agreed. :) [16:48:28] (03PS2) 10Bking: cloudelastic: Begin private IP migration for cloudelastic1007 [puppet] - 10https://gerrit.wikimedia.org/r/999088 (https://phabricator.wikimedia.org/T355617) [16:49:36] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999088 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [16:55:41] (03Merged) 10jenkins-bot: color-link-visited was not defined [skins/MinervaNeue] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998974 (https://phabricator.wikimedia.org/T356928) (owner: 10Jdlrobson) [16:56:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P56592 and previous config saved to /var/cache/conftool/dbconfig/20240209-165657-ladsgroup.json [16:56:59] “The following are unexpected commits pulled from origin for /srv/mediawiki-staging” [16:57:00] * Lucas_WMDE looks [16:57:27] I guess https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/998624 was +2ed without being pulled to deployment.eqiad.wmnet [16:57:46] * Lucas_WMDE continues with deployment [16:57:50] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:998974|color-link-visited was not defined (T356928)]] [16:57:54] T356928: Regression: Visited links on mobile appearing as black - https://phabricator.wikimedia.org/T356928 [16:59:20] !log lucaswerkmeister-wmde@deploy2002 jdlrobson and lucaswerkmeister-wmde: Backport for [[gerrit:998974|color-link-visited was not defined (T356928)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:03:22] (03CR) 10BCornwall: Add module for ncmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [17:03:26] Jdlrobson: want to test the change on mwdebug? [17:03:30] (sorry, got distracted for a few minutes) [17:04:25] Lucas_WMDE: yep on it [17:04:27] fix seems to work for me, at least (though it needed a Ctrl+F5) [17:04:27] ok [17:04:38] yep that works [17:04:45] please sync Lucas_WMDE [17:04:47] !log lucaswerkmeister-wmde@deploy2002 jdlrobson and lucaswerkmeister-wmde: Continuing with sync [17:04:53] ok, thanks for testing! [17:08:04] (03CR) 10Vgutierrez: [C: 04-1] Add module for ncmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [17:11:04] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:998974|color-link-visited was not defined (T356928)]] (duration: 13m 13s) [17:11:22] T356928: Regression: Visited links on mobile appearing as black - https://phabricator.wikimedia.org/T356928 [17:11:26] * Lucas_WMDE done [17:12:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T352010)', diff saved to https://phabricator.wikimedia.org/P56593 and previous config saved to /var/cache/conftool/dbconfig/20240209-171203-ladsgroup.json [17:12:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:12:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:12:21] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:12:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3315 (T352010)', diff saved to https://phabricator.wikimedia.org/P56594 and previous config saved to /var/cache/conftool/dbconfig/20240209-171225-ladsgroup.json [17:16:57] (03PS1) 10EoghanGaffney: [phabricator] Ignore 'some files vanished' error for phab repos rsync [puppet] - 10https://gerrit.wikimedia.org/r/999945 [17:18:04] (03CR) 10Raymond Ndibe: "Thanks for pointing this out Taavi. The 1GB `var/lib/nginx` folder and the 512m `client_max_body_size` we have right now also seems proble" [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) (owner: 10Raymond Ndibe) [17:18:28] !log rolling restart of pods on k8s aux eqiad T356661 [17:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:44] T356661: Cross fleet runc upgrades - https://phabricator.wikimedia.org/T356661 [17:18:50] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1356/co" [puppet] - 10https://gerrit.wikimedia.org/r/999945 (owner: 10EoghanGaffney) [17:18:54] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10Dzahn) [17:18:55] thanks Lucas_WMDE ! Looks like it's working in production now! [17:19:01] \o/ [17:19:58] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [17:20:40] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10Dzahn) [17:20:49] (PuppetFailure) firing: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:21:09] 10SRE, 10Infrastructure-Foundations, 10collaboration-services, 10vm-requests: Site: 1 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10Dzahn) 05Open→03In progress a:03Dzahn [17:21:18] 10SRE, 10Infrastructure-Foundations, 10collaboration-services, 10vm-requests: Site: 1 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10Dzahn) [17:21:20] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Dzahn) [17:24:17] 10SRE, 10Infrastructure-Foundations, 10collaboration-services, 10vm-requests: Site: 1 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10Dzahn) [17:26:55] 10SRE, 10Infrastructure-Foundations, 10collaboration-services, 10vm-requests: Site: 1 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10Dzahn) [17:27:25] (03PS1) 10Dzahn: site: add etherpad1004 with insetup-role [puppet] - 10https://gerrit.wikimedia.org/r/999957 (https://phabricator.wikimedia.org/T357159) [17:28:45] (03CR) 10BCornwall: Add module for ncmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [17:29:18] (03CR) 10Dzahn: [C: 03+2] site: add etherpad1004 with insetup-role [puppet] - 10https://gerrit.wikimedia.org/r/999957 (https://phabricator.wikimedia.org/T357159) (owner: 10Dzahn) [17:30:19] 10SRE, 10Infrastructure-Foundations, 10collaboration-services, 10vm-requests, 10Patch-For-Review: Site: 1 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10Dzahn) ` dzahn@cumin1002:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 2 --disk 15 --cluster eqiad -t T357159 --group B... [17:30:20] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host etherpad1004.eqiad.wmnet [17:30:21] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [17:30:26] (03CR) 10Majavah: [C: 04-1] "I would prefer to disable buffering unless there is a reason not to. The proxies have 4G RAM so we have some headroom increasing the `/var" [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) (owner: 10Raymond Ndibe) [17:33:46] (03CR) 10David Caro: "I agree, better skip needed extra resources on the proxy." [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) (owner: 10Raymond Ndibe) [17:34:19] (03CR) 10David Caro: "In general I mean, not only for harbor pushes." [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) (owner: 10Raymond Ndibe) [17:35:24] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad1004.eqiad.wmnet - dzahn@cumin1002" [17:39:50] !log merging netbox/hiera data changes that add restbase hosts and show up when I run unrelated cookbook creating a new VM - T354893 [17:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:03] T354893: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 [17:40:16] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM etherpad1004.eqiad.wmnet - dzahn@cumin1002" [17:40:16] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:40:16] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache etherpad1004.eqiad.wmnet on all recursors [17:40:19] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) etherpad1004.eqiad.wmnet on all recursors [17:40:46] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM etherpad1004.eqiad.wmnet - dzahn@cumin1002" [17:41:37] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM etherpad1004.eqiad.wmnet - dzahn@cumin1002" [17:43:06] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host etherpad1004.eqiad.wmnet with OS bookworm [17:43:11] 10SRE, 10Infrastructure-Foundations, 10collaboration-services, 10vm-requests: Site: 1 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host etherpad1004.eqiad.wmnet with OS bookworm [17:43:15] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host etherpad1004.eqiad.wmnet with OS bookworm [17:43:15] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host etherpad1004.eqiad.wmnet [17:43:20] 10SRE, 10Infrastructure-Foundations, 10collaboration-services, 10vm-requests: Site: 1 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host etherpad1004.eqiad.wmnet with OS bookworm executed with... [17:53:24] (03PS1) 10Ebernhardson: cirrus: Re-enable cloudelastic writes for non-testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999962 (https://phabricator.wikimedia.org/T352335) [17:54:58] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye [17:55:14] (03PS2) 10Dzahn: [phabricator] Ignore 'some files vanished' error for phab repos rsync [puppet] - 10https://gerrit.wikimedia.org/r/999945 (https://phabricator.wikimedia.org/T357158) (owner: 10EoghanGaffney) [17:55:57] (03CR) 10Dzahn: [C: 03+2] [phabricator] Ignore 'some files vanished' error for phab repos rsync [puppet] - 10https://gerrit.wikimedia.org/r/999945 (https://phabricator.wikimedia.org/T357158) (owner: 10EoghanGaffney) [17:55:59] (03CR) 10Dzahn: [V: 03+2 C: 03+2] [phabricator] Ignore 'some files vanished' error for phab repos rsync [puppet] - 10https://gerrit.wikimedia.org/r/999945 (https://phabricator.wikimedia.org/T357158) (owner: 10EoghanGaffney) [18:01:10] (03PS1) 10Majavah: puppet: do not log diff for private keys [puppet] - 10https://gerrit.wikimedia.org/r/999964 [18:01:43] (03CR) 10CDanis: [C: 03+1] puppet: do not log diff for private keys [puppet] - 10https://gerrit.wikimedia.org/r/999964 (owner: 10Majavah) [18:02:10] (03CR) 10JHathaway: [C: 03+1] puppet: do not log diff for private keys [puppet] - 10https://gerrit.wikimedia.org/r/999964 (owner: 10Majavah) [18:03:31] (03CR) 10Majavah: [C: 03+2] puppet: do not log diff for private keys [puppet] - 10https://gerrit.wikimedia.org/r/999964 (owner: 10Majavah) [18:05:48] (PuppetFailure) resolved: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:11:09] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1008.eqiad.wmnet with reason: host reimage [18:11:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10VRiley-WMF) a:05bking→03VRiley-WMF [18:14:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1008.eqiad.wmnet with reason: host reimage [18:15:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10VRiley-WMF) Worked with @bking on this. Verified it was okay to power down. Reseated the cable for the backplane and gave it a very stern... [18:15:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10VRiley-WMF) 05Open→03Resolved [18:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:19:52] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host etherpad1004.eqiad.wmnet with OS bookworm [18:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:22:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147 (10Dzahn) [18:23:47] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147 (10Dzahn) user already has NDA and shell access, it's only about adding to the extra group. so all boxes checked besides the group approval [18:25:35] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1124.eqiad.wmnet - https://phabricator.wikimedia.org/T334388 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF [18:26:19] (03PS3) 10CDanis: [aux-k8s-eqiad] add kube-state-metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/978129 (https://phabricator.wikimedia.org/T264625) [18:26:52] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147 (10Dzahn) @odimitrijevic group approval is requested for this addition of a Wikidata/WMDE user (with existing NDA) to analytics-privatedata-users. ` Reason for access:... [18:26:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147 (10Dzahn) 05Open→03In progress [18:27:20] (03PS4) 10CDanis: [aux-k8s-eqiad] add kube-state-metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/978129 (https://phabricator.wikimedia.org/T264625) [18:27:58] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147 (10Dzahn) p:05Triage→03Medium [18:32:08] (03CR) 10CDanis: [C: 03+2] [aux-k8s-eqiad] add kube-state-metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/978129 (https://phabricator.wikimedia.org/T264625) (owner: 10CDanis) [18:32:21] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on etherpad1004.eqiad.wmnet with reason: host reimage [18:32:41] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355170 (10Dzahn) Hi @Arrbee, this ticket still needs some clarification from you what is needed. Thank you [18:32:46] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [18:34:49] (03Merged) 10jenkins-bot: [aux-k8s-eqiad] add kube-state-metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/978129 (https://phabricator.wikimedia.org/T264625) (owner: 10CDanis) [18:35:13] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on etherpad1004.eqiad.wmnet with reason: host reimage [18:35:41] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [18:35:42] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:35:55] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [18:35:58] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:36:07] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [18:36:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1008.eqiad.wmnet with OS bullseye [18:37:24] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [18:37:26] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:38:32] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [18:39:32] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:49:07] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host etherpad1004.eqiad.wmnet with OS bookworm [18:49:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P56597 and previous config saved to /var/cache/conftool/dbconfig/20240209-184910-ladsgroup.json [18:49:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:52:25] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Dzahn) [18:52:31] 10SRE, 10Infrastructure-Foundations, 10collaboration-services, 10vm-requests: Site: 1 VM %request for etherpad - https://phabricator.wikimedia.org/T357159 (10Dzahn) 05In progress→03Resolved p:05Triage→03Medium reimage failed because the puppetmaster had an issue at this time. reimaged again after... [18:53:38] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Dzahn) I created new VM etherpad1004 with bookworm. It currently has the "insetup" role applied and can be used. (T357159) [18:57:15] (03PS1) 10Dzahn: site: add etherpad role to etherpad1004 [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421) [19:01:17] (03PS13) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [19:04:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P56598 and previous config saved to /var/cache/conftool/dbconfig/20240209-190416-ladsgroup.json [19:09:17] (03PS35) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [19:13:12] (03PS14) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [19:13:49] 10SRE, 10Traffic: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10ssingh) Thanks for the feedback folks on the task and on IRC. We plan to merge this patch next week (week of February 12) since there have been no concerns raised so far. If there are any concerns... [19:19:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P56599 and previous config saved to /var/cache/conftool/dbconfig/20240209-191923-ladsgroup.json [19:26:10] (03PS15) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [19:26:12] (03PS36) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [19:26:31] (03CR) 10CI reject: [V: 04-1] mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [19:27:48] (03PS37) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [19:29:30] (03PS4) 10Ryan Kemper: cloudelastic: Complete cloudelastic1008's migration [puppet] - 10https://gerrit.wikimedia.org/r/998498 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:30:02] (03CR) 10Bking: [C: 03+2] cloudelastic: Complete cloudelastic1008's migration [puppet] - 10https://gerrit.wikimedia.org/r/998498 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:32:42] (03PS1) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/999987 (https://phabricator.wikimedia.org/T347624) [19:34:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P56600 and previous config saved to /var/cache/conftool/dbconfig/20240209-193430-ladsgroup.json [19:34:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:34:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:34:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:34:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P56601 and previous config saved to /var/cache/conftool/dbconfig/20240209-193452-ladsgroup.json [19:43:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T352010)', diff saved to https://phabricator.wikimedia.org/P56602 and previous config saved to /var/cache/conftool/dbconfig/20240209-194310-ladsgroup.json [19:43:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:44:05] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:44:19] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:45:21] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Dzahn) Created etherpad-bookworm.devtools in wmcs, applied prod role there. Besides the obvious, missing etherpad-lite package, I noticed: `... [19:48:45] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:48:59] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:53:39] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:54:33] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:58:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P56603 and previous config saved to /var/cache/conftool/dbconfig/20240209-195817-ladsgroup.json [19:58:21] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:06:05] (03PS38) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [20:13:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P56604 and previous config saved to /var/cache/conftool/dbconfig/20240209-201324-ladsgroup.json [20:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:17:25] (SystemdUnitFailed) firing: (2) elasticsearch-disable-readahead.service Failed on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:22:25] (SystemdUnitFailed) resolved: (2) elasticsearch-disable-readahead.service Failed on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:22:48] (03PS1) 10Bking: elasticsearch: avoid systemd timeouts when large clusters start up [puppet] - 10https://gerrit.wikimedia.org/r/1000018 (https://phabricator.wikimedia.org/T355617) [20:23:48] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:23:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1000018 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:28:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T352010)', diff saved to https://phabricator.wikimedia.org/P56605 and previous config saved to /var/cache/conftool/dbconfig/20240209-202830-ladsgroup.json [20:28:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [20:28:36] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:28:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [20:28:53] (03CR) 10Ebernhardson: [C: 03+1] elasticsearch: avoid systemd timeouts when large clusters start up [puppet] - 10https://gerrit.wikimedia.org/r/1000018 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:29:11] (03PS1) 10Dzahn: miscweb: bump bugzilla to version 2024-02-09-201707 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000025 (https://phabricator.wikimedia.org/T317436) [20:29:40] (03CR) 10Bking: [C: 03+2] elasticsearch: avoid systemd timeouts when large clusters start up [puppet] - 10https://gerrit.wikimedia.org/r/1000018 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:29:58] (03CR) 10Dzahn: "https://gitlab.wikimedia.org/repos/sre/miscweb/bugzilla/-/jobs/205945" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000025 (https://phabricator.wikimedia.org/T317436) (owner: 10Dzahn) [20:33:19] (03PS1) 10Effie Mouzeli: cache.mcrouter: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000032 [20:38:06] (03PS2) 10Effie Mouzeli: mediawiki: Bump sextant module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/999780 (owner: 10Alexandros Kosiaris) [20:45:07] (03PS3) 10Effie Mouzeli: mediawiki: Bump sextant module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/999780 (owner: 10Alexandros Kosiaris) [20:46:07] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply new systemd settings - bking@cumin2002 - T355617 [20:46:12] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [20:49:14] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 209 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 241, active_shards: 241, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 209, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num [20:49:14] n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 53.55555555555556 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:49:23] (03PS1) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 [20:49:54] ^^ elastic alert is expected [20:50:22] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 241, active_shards: 450, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [20:50:22] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:54:15] (03PS2) 10Effie Mouzeli: cache.mcrouter: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000032 [20:54:48] (03PS3) 10Effie Mouzeli: cache.mcrouter: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000032 [20:55:36] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply new systemd settings - bking@cumin2002 - T355617 [20:55:42] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [21:06:53] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new systemd settings - bking@cumin2002 - T355617 [21:07:09] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [21:09:42] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new systemd settings - bking@cumin2002 - T355617 [21:19:30] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:19:58] (03CR) 10Cwhite: "Nothing about this patch immediately feels problematic, but there may be history here that I'm unaware of. I'd defer to Keith or Filippo." [puppet] - 10https://gerrit.wikimedia.org/r/997555 (owner: 10JHathaway) [21:20:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 125 probes of 727 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:21:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:24:38] (03CR) 10JHathaway: "Thanks for taking a look @cwhite, fortunately they mostly stole erb's syntax, so the new cognitive load is pretty small!" [puppet] - 10https://gerrit.wikimedia.org/r/997555 (owner: 10JHathaway) [21:25:36] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 43 probes of 727 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:36:16] (03CR) 10Bartosz Dziewoński: [C: 03+1] MobileFrontend: Set fallback editor to 'visual' on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999813 (owner: 10Esanders) [21:36:48] !log bking@deploy2002 install 'python3-plac' pkg T348685 [21:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:53] T348685: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 [21:38:59] !log bking@deploy2002 install 'python3-boto3' pkg T348685 [21:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:37] (03PS1) 10Andrew Bogott: Designate nova_fixed_multi: run through Black [puppet] - 10https://gerrit.wikimedia.org/r/1000073 [21:51:39] (03PS1) 10Andrew Bogott: Designate nova_fixed_multi: add some debug lines [puppet] - 10https://gerrit.wikimedia.org/r/1000074 (https://phabricator.wikimedia.org/T356516) [21:53:04] (03CR) 10CI reject: [V: 04-1] Designate nova_fixed_multi: add some debug lines [puppet] - 10https://gerrit.wikimedia.org/r/1000074 (https://phabricator.wikimedia.org/T356516) (owner: 10Andrew Bogott) [21:57:08] (03CR) 10Andrew Bogott: [C: 03+2] Designate nova_fixed_multi: run through Black [puppet] - 10https://gerrit.wikimedia.org/r/1000073 (owner: 10Andrew Bogott) [21:59:34] (03PS2) 10Andrew Bogott: Designate nova_fixed_multi: add some debug lines [puppet] - 10https://gerrit.wikimedia.org/r/1000074 (https://phabricator.wikimedia.org/T356516) [22:06:47] (03CR) 10Andrew Bogott: [C: 03+2] Designate nova_fixed_multi: add some debug lines [puppet] - 10https://gerrit.wikimedia.org/r/1000074 (https://phabricator.wikimedia.org/T356516) (owner: 10Andrew Bogott) [22:16:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:16:25] (03Abandoned) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/999987 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [22:18:50] (03PS24) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [22:21:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:23:16] (03CR) 10Ryan Kemper: [C: 03+2] "Forgot to push comment earlier; no action needed here" [puppet] - 10https://gerrit.wikimedia.org/r/991427 (https://phabricator.wikimedia.org/T350464) (owner: 10Ryan Kemper) [22:24:43] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:31:17] (03PS23) 10BCornwall: Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) [22:32:45] (03CR) 10BCornwall: Add module for ncmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [22:40:29] (03CR) 10Ottomata: WIP - add webrequest.frontend stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [22:51:29] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10TimedMediaHandler, 10media-backups: Consider increasing $wgTranscodeBackgroundSizeLimit to 5GB - https://phabricator.wikimedia.org/T357184 (10Bawolff) [23:04:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [23:04:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [23:04:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1230 (T352010)', diff saved to https://phabricator.wikimedia.org/P56606 and previous config saved to /var/cache/conftool/dbconfig/20240209-230425-ladsgroup.json [23:04:36] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:45:13] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10TimedMediaHandler, 10media-backups: Consider increasing $wgTranscodeBackgroundSizeLimit to 5GB - https://phabricator.wikimedia.org/T357184 (10brion) Here's a 4K video that fits in the previous upload limit but has an estimated bitrate resulti...