[00:01:06] (03PS1) 10Andrew Bogott: octavia.conf: set heartbeat_key [puppet] - 10https://gerrit.wikimedia.org/r/1146780 [00:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:02:23] (03CR) 10Andrew Bogott: [C:03+2] octavia.conf: set heartbeat_key [puppet] - 10https://gerrit.wikimedia.org/r/1146780 (owner: 10Andrew Bogott) [00:06:41] FIRING: [15x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:09:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1146781 [00:09:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1146781 (owner: 10TrainBranchBot) [00:11:41] FIRING: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:15:14] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [00:17:44] (03PS1) 10Andrew Bogott: Octavia.conf: make heartbeat_key a secret [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) [00:18:49] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [00:18:50] (03CR) 10CI reject: [V:04-1] Octavia.conf: make heartbeat_key a secret [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [00:19:14] (03PS1) 10Andrew Bogott: Add secret octavia heartbeat keys [labs/private] - 10https://gerrit.wikimedia.org/r/1146784 [00:20:33] (03PS2) 10Andrew Bogott: Octavia.conf: make heartbeat_key a secret [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) [00:21:30] (03PS2) 10Andrew Bogott: Add secret octavia heartbeat keys [labs/private] - 10https://gerrit.wikimedia.org/r/1146784 [00:21:45] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add secret octavia heartbeat keys [labs/private] - 10https://gerrit.wikimedia.org/r/1146784 (owner: 10Andrew Bogott) [00:22:01] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [00:27:02] (03PS1) 10Andrew Bogott: Correct spelling of 'heartbeat' [labs/private] - 10https://gerrit.wikimedia.org/r/1146785 [00:28:21] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Correct spelling of 'heartbeat' [labs/private] - 10https://gerrit.wikimedia.org/r/1146785 (owner: 10Andrew Bogott) [00:28:59] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [00:29:10] (03PS3) 10Andrew Bogott: Octavia.conf: make heartbeat_key a secret. [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) [00:29:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [00:30:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1146781 (owner: 10TrainBranchBot) [00:31:57] FIRING: HelmReleaseBadStatus: Helm release miscweb/design-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:32:17] (03PS4) 10Andrew Bogott: Octavia.conf: make heartbeat_key a secret [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) [00:32:25] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [00:33:22] (03CR) 10CI reject: [V:04-1] Octavia.conf: make heartbeat_key a secret [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [00:34:41] (03PS5) 10Andrew Bogott: Octavia.conf: make heartbeat_key a secret [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) [00:34:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [00:37:40] (03CR) 10Andrew Bogott: [C:03+2] Octavia.conf: make heartbeat_key a secret [puppet] - 10https://gerrit.wikimedia.org/r/1146782 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [00:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:46:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:48:49] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5031 [00:49:03] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5031.* [00:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:03:06] (03PS1) 10Scott French: P:mw:maint:update_special_pages: remove absented non-sharded job [puppet] - 10https://gerrit.wikimedia.org/r/1146787 (https://phabricator.wikimedia.org/T388534) [01:03:06] (03CR) 10Scott French: "Thanks in advance for the review! I've already looked at these a bit when trying to understand their interaction with update-flaggedrev-st" [puppet] - 10https://gerrit.wikimedia.org/r/1146787 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [01:03:11] (03PS1) 10Scott French: P:mw:maint:update_special_pages: updateSpecialPages in s6 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146788 (https://phabricator.wikimedia.org/T388534) [01:04:33] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3074.* [01:06:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:06:33] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3066.* [01:16:22] !log Restarting tomcat10 on idp1004 [01:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:42] RESOLVED: JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:22:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:27:26] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS bullseye [01:27:37] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828547 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host cass... [01:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:46:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [01:56:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [02:14:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [02:23:27] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1076 to cirrussearch1076 [02:23:28] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1077 to cirrussearch1077 [02:23:51] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [02:29:24] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [02:29:25] ryankemper@cumin2002 rename (PID 4068732) is awaiting input [02:30:29] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1076 to cirrussearch1076 - ryankemper@cumin2002" [02:30:49] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1076 to cirrussearch1076 - ryankemper@cumin2002" [02:30:50] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:30:50] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1076 on all recursors [02:30:53] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1076 on all recursors [02:30:54] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1076 [02:31:58] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:31:59] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1077 on all recursors [02:32:02] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1077 on all recursors [02:32:03] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1077 [02:32:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1076 [02:32:51] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1076 to cirrussearch1076 [02:34:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1077 [02:34:42] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1077 to cirrussearch1077 [02:37:02] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1076.eqiad.wmnet with OS bullseye [02:37:06] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1076 [02:37:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1076 [02:43:57] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1077.eqiad.wmnet with OS bullseye [02:44:02] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1077 [02:44:02] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1077 [02:46:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [02:51:32] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1076.eqiad.wmnet with reason: host reimage [02:55:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1076.eqiad.wmnet with reason: host reimage [02:58:11] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1077.eqiad.wmnet with reason: host reimage [03:01:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1077.eqiad.wmnet with reason: host reimage [03:20:51] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1076.eqiad.wmnet with OS bullseye [03:27:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1077.eqiad.wmnet with OS bullseye [03:49:00] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1078 to cirrussearch1078 [03:49:25] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [03:55:01] ryankemper@cumin2002 rename (PID 4110877) is awaiting input [03:58:51] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1078 to cirrussearch1078 - ryankemper@cumin2002" [04:01:57] ryankemper@cumin2002 rename (PID 4110877) is awaiting input [04:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:02:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1078 to cirrussearch1078 - ryankemper@cumin2002" [04:02:24] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:02:24] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1078 on all recursors [04:02:27] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1078 on all recursors [04:02:28] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1078 [04:05:35] ryankemper@cumin2002 rename (PID 4110877) is awaiting input [04:10:18] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1079 to cirrussearch1079 [04:10:31] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [04:10:34] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1078 [04:11:14] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1078 to cirrussearch1078 [04:11:41] FIRING: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:16:06] ryankemper@cumin2002 rename (PID 4121019) is awaiting input [04:26:59] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1079 to cirrussearch1079 - ryankemper@cumin2002" [04:29:37] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1079 to cirrussearch1079 - ryankemper@cumin2002" [04:29:38] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:29:38] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1079 on all recursors [04:29:41] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1079 on all recursors [04:29:42] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1079 [04:31:05] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1079 [04:31:44] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1079 to cirrussearch1079 [04:31:57] FIRING: HelmReleaseBadStatus: Helm release miscweb/design-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:35:16] ryankemper@cumin2002 reimage (PID 4131236) is awaiting input [04:42:09] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1078.eqiad.wmnet with OS bullseye [04:42:12] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1078 [04:42:13] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1078 [04:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:49] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1079.eqiad.wmnet with OS bullseye [04:49:53] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1079 [04:49:53] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1079 [04:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:55:08] (03PS2) 10Clément Goubert: zarcillo: Fix ingress and egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146602 [04:56:26] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1078.eqiad.wmnet with reason: host reimage [04:59:00] (03CR) 10Clément Goubert: [C:03+2] "I'll be careful reactivating it, but I need to do the tests. Merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146627 (https://phabricator.wikimedia.org/T394019) (owner: 10Clément Goubert) [05:00:26] (03Merged) 10jenkins-bot: mw-cron: Suspend growthexperiments-listtaskcounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146627 (https://phabricator.wikimedia.org/T394019) (owner: 10Clément Goubert) [05:01:16] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [05:01:26] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [05:01:29] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1078.eqiad.wmnet with reason: host reimage [05:04:14] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1079.eqiad.wmnet with reason: host reimage [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1046 es2044 T391921', diff saved to https://phabricator.wikimedia.org/P76228 and previous config saved to /var/cache/conftool/dbconfig/20250516-050707-marostegui.json [05:07:11] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [05:07:45] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet,es1046.eqiad.wmnet with reason: Maintenance [05:07:53] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1079.eqiad.wmnet with reason: host reimage [05:08:22] (03PS1) 10Marostegui: es1046: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146830 (https://phabricator.wikimedia.org/T391921) [05:09:36] (03CR) 10Marostegui: [C:03+2] es1046: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146830 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:11:44] (03PS1) 10Marostegui: es2044: Migrate MariaDB to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146832 (https://phabricator.wikimedia.org/T391921) [05:13:09] (03CR) 10Marostegui: [C:03+2] es2044: Migrate MariaDB to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146832 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:14:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76229 and previous config saved to /var/cache/conftool/dbconfig/20250516-051442-root.json [05:17:39] (03CR) 10Muehlenhoff: [C:03+1] "The key has been validated via an out-of-band channel" [puppet] - 10https://gerrit.wikimedia.org/r/1146725 (owner: 10Greg Grossmeier) [05:17:41] (03CR) 10Muehlenhoff: [C:03+2] admin: update gjg's production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1146725 (owner: 10Greg Grossmeier) [05:25:29] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1078.eqiad.wmnet with OS bullseye [05:26:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76230 and previous config saved to /var/cache/conftool/dbconfig/20250516-052625-root.json [05:27:37] (03PS1) 10Muehlenhoff: Remove access for oljad [puppet] - 10https://gerrit.wikimedia.org/r/1146835 [05:27:52] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1079.eqiad.wmnet with OS bullseye [05:28:03] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1085 to cirrussearch1085 [05:28:26] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [05:29:29] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic1086 to cirrussearch1086 [05:29:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76231 and previous config saved to /var/cache/conftool/dbconfig/20250516-052947-root.json [05:32:36] (03PS1) 10Marostegui: installserver: Remove db1258 [puppet] - 10https://gerrit.wikimedia.org/r/1146837 (https://phabricator.wikimedia.org/T393989) [05:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:33:35] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1085 to cirrussearch1085 - ryankemper@cumin2002" [05:33:41] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [05:35:25] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1085 to cirrussearch1085 - ryankemper@cumin2002" [05:35:25] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:35:25] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1085 on all recursors [05:35:29] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1085 on all recursors [05:35:30] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1085 [05:35:32] (03CR) 10Marostegui: [C:03+2] installserver: Remove db1258 [puppet] - 10https://gerrit.wikimedia.org/r/1146837 (https://phabricator.wikimedia.org/T393989) (owner: 10Marostegui) [05:37:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:38:36] ryankemper@cumin2002 rename (PID 4160589) is awaiting input [05:39:14] ryankemper@cumin2002 rename (PID 4161047) is awaiting input [05:41:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76232 and previous config saved to /var/cache/conftool/dbconfig/20250516-054131-root.json [05:44:25] (03CR) 10Clément Goubert: [C:03+1] P:mw:maint:update_special_pages: remove absented non-sharded job [puppet] - 10https://gerrit.wikimedia.org/r/1146787 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [05:44:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76233 and previous config saved to /var/cache/conftool/dbconfig/20250516-054452-root.json [05:44:55] (03CR) 10Clément Goubert: [C:03+1] P:mw:maint:update_special_pages: updateSpecialPages in s6 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146788 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [05:50:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1085 [05:50:51] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1085 to cirrussearch1085 [05:51:38] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1086 to cirrussearch1086 - ryankemper@cumin2002" [05:51:44] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1086 to cirrussearch1086 - ryankemper@cumin2002" [05:51:44] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:51:45] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1086 on all recursors [05:51:48] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1086 on all recursors [05:51:49] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1086 [05:53:17] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1086 [05:53:56] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1086 to cirrussearch1086 [05:56:34] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1145927 (owner: 10Slyngshede) [05:56:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76234 and previous config saved to /var/cache/conftool/dbconfig/20250516-055637-root.json [05:59:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76235 and previous config saved to /var/cache/conftool/dbconfig/20250516-055958-root.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250516T0600) [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:02:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [06:02:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [06:06:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1045 and es2045 to es5 masters T391921', diff saved to https://phabricator.wikimedia.org/P76236 and previous config saved to /var/cache/conftool/dbconfig/20250516-060652-marostegui.json [06:06:59] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [06:11:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76237 and previous config saved to /var/cache/conftool/dbconfig/20250516-061142-root.json [06:14:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:15:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76238 and previous config saved to /var/cache/conftool/dbconfig/20250516-061503-root.json [06:16:23] (03CR) 10Slyngshede: [C:03+2] SSHKey: Reimplement key suspension in Vue [software/bitu] - 10https://gerrit.wikimedia.org/r/1145927 (owner: 10Slyngshede) [06:16:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2046 es1044 T391921', diff saved to https://phabricator.wikimedia.org/P76239 and previous config saved to /var/cache/conftool/dbconfig/20250516-061649-marostegui.json [06:16:53] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [06:17:11] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1146835 (owner: 10Muehlenhoff) [06:17:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2046.codfw.wmnet,es1044.eqiad.wmnet with reason: Maintenance [06:17:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:35] (03CR) 10Slyngshede: [V:03+2 C:03+2] Login success: Avoid truncating attribute lists [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1146579 (owner: 10Slyngshede) [06:17:57] (03PS1) 10Marostegui: es1044: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146853 (https://phabricator.wikimedia.org/T391921) [06:18:33] !log installing Java 21 security updates on idp-test [06:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:55] (03Merged) 10jenkins-bot: SSHKey: Reimplement key suspension in Vue [software/bitu] - 10https://gerrit.wikimedia.org/r/1145927 (owner: 10Slyngshede) [06:19:07] (03CR) 10Marostegui: [C:03+2] es1044: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146853 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:20:56] (03PS1) 10Marostegui: es2046: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146860 (https://phabricator.wikimedia.org/T391921) [06:21:37] (03PS1) 10Slyngshede: Release: 7.1.6+wmf12u2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1146862 [06:22:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76240 and previous config saved to /var/cache/conftool/dbconfig/20250516-062213-root.json [06:22:18] (03CR) 10Marostegui: [C:03+2] es2046: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146860 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:22:25] RESOLVED: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:25] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:26:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76241 and previous config saved to /var/cache/conftool/dbconfig/20250516-062648-root.json [06:27:50] !log uploaded openjdk-21 21.0.7+6-1~deb12u1 to component/jdk21 for bookworm (latest Java 21 security release) [06:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76242 and previous config saved to /var/cache/conftool/dbconfig/20250516-062851-root.json [06:30:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76243 and previous config saved to /var/cache/conftool/dbconfig/20250516-063009-root.json [06:32:59] (03PS2) 10Slyngshede: Release: 7.1.6+wmf12u2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1146862 [06:35:10] (03PS1) 10Marostegui: wmnet: Update es5-master [dns] - 10https://gerrit.wikimedia.org/r/1146866 (https://phabricator.wikimedia.org/T391921) [06:35:20] (03CR) 10Marostegui: "This is a NOOP" [dns] - 10https://gerrit.wikimedia.org/r/1146866 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:36:28] (03CR) 10Marostegui: [C:03+2] wmnet: Update es5-master [dns] - 10https://gerrit.wikimedia.org/r/1146866 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:36:31] !log marostegui@dns1006 START - running authdns-update [06:36:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one typo inline" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1146862 (owner: 10Slyngshede) [06:37:09] !log marostegui@dns1006 END - running authdns-update [06:37:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76244 and previous config saved to /var/cache/conftool/dbconfig/20250516-063719-root.json [06:41:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76245 and previous config saved to /var/cache/conftool/dbconfig/20250516-064153-root.json [06:41:57] (03PS1) 10Marostegui: sections.yaml: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146871 (https://phabricator.wikimedia.org/T394260) [06:42:12] (03CR) 10Marostegui: "The other conftool change was already merged at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146175" [puppet] - 10https://gerrit.wikimedia.org/r/1146871 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [06:42:50] (03CR) 10Marostegui: "And: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146176" [puppet] - 10https://gerrit.wikimedia.org/r/1146871 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [06:43:56] (03PS3) 10Slyngshede: Release: 7.1.6+wmf12u2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1146862 [06:43:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76246 and previous config saved to /var/cache/conftool/dbconfig/20250516-064356-root.json [06:44:31] (03CR) 10Slyngshede: [V:03+2 C:03+2] Release: 7.1.6+wmf12u2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1146862 (owner: 10Slyngshede) [06:44:41] (03CR) 10Slyngshede: [V:03+2 C:03+2] Release: 7.1.6+wmf12u2 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1146862 (owner: 10Slyngshede) [06:46:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:52:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76247 and previous config saved to /var/cache/conftool/dbconfig/20250516-065224-root.json [06:57:41] (03PS1) 10Clément Goubert: alertmanager: Add notifications-echo task creation route [puppet] - 10https://gerrit.wikimedia.org/r/1146874 (https://phabricator.wikimedia.org/T394471) [06:59:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76248 and previous config saved to /var/cache/conftool/dbconfig/20250516-065901-root.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250516T0700) [07:02:15] (03PS1) 10Clément Goubert: mw::maintenance: migrate echo_mail_batch to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1146875 (https://phabricator.wikimedia.org/T394471) [07:07:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76249 and previous config saved to /var/cache/conftool/dbconfig/20250516-070730-root.json [07:14:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76250 and previous config saved to /var/cache/conftool/dbconfig/20250516-071406-root.json [07:20:34] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1177.eqiad.wmnet with OS bullseye [07:20:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10828756 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker... [07:21:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [07:22:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76251 and previous config saved to /var/cache/conftool/dbconfig/20250516-072235-root.json [07:26:28] (03CR) 10Muehlenhoff: [C:03+2] Remove access for oljad [puppet] - 10https://gerrit.wikimedia.org/r/1146835 (owner: 10Muehlenhoff) [07:29:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76253 and previous config saved to /var/cache/conftool/dbconfig/20250516-072911-root.json [07:29:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:41] RESOLVED: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:37:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76255 and previous config saved to /var/cache/conftool/dbconfig/20250516-073741-root.json [07:41:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [07:44:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76256 and previous config saved to /var/cache/conftool/dbconfig/20250516-074417-root.json [07:44:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:49:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:38] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on build2002.codfw.wmnet with reason: busy JDK build [07:52:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76257 and previous config saved to /var/cache/conftool/dbconfig/20250516-075246-root.json [07:55:47] (03PS1) 10Kevin Bazira: Add vLLM image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) [07:59:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76258 and previous config saved to /var/cache/conftool/dbconfig/20250516-075923-root.json [08:01:30] (03CR) 10Stevemunene: [C:03+1] airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [08:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [08:07:36] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging ODimitrijevic out of all services on: 945 hosts [08:07:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76259 and previous config saved to /var/cache/conftool/dbconfig/20250516-080752-root.json [08:08:53] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging ODimitrijevic out of all services on: 1426 hosts [08:14:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76260 and previous config saved to /var/cache/conftool/dbconfig/20250516-081428-root.json [08:18:45] (03PS1) 10Fabfur: external_cloud_vendors: discard 6to4 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1146934 (https://phabricator.wikimedia.org/T394474) [08:21:01] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146934 (https://phabricator.wikimedia.org/T394474) (owner: 10Fabfur) [08:21:26] (03PS1) 10Aqu: airflow-analytics-test: Bump parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146935 (https://phabricator.wikimedia.org/T369845) [08:22:54] (03CR) 10Federico Ceratto: [C:03+1] "LGTM as discussed on IRC" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146602 (owner: 10Clément Goubert) [08:23:28] (03CR) 10Majavah: [V:03+1] "Tested on toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/1146661 (https://phabricator.wikimedia.org/T394283) (owner: 10Majavah) [08:24:33] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146601 (owner: 10Clément Goubert) [08:25:07] (03CR) 10Clément Goubert: [C:03+2] python-webapp: Include base.networkpolicy.egress.mariadb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146601 (owner: 10Clément Goubert) [08:25:18] (03CR) 10Clément Goubert: [C:03+2] zarcillo: Fix ingress and egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146602 (owner: 10Clément Goubert) [08:25:34] (03CR) 10Brouberol: [C:03+1] airflow-analytics-test: Bump parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146935 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:25:36] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: Bump parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146935 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:26:02] !log uploaded httpbb 0.0.5-1+deb12u1 to apt.wikimedia.org T393711 T389380 [08:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:07] T393711: httpbb bookworm support - https://phabricator.wikimedia.org/T393711 [08:26:07] T389380: Upgrade Cumin hosts to Bookworm - https://phabricator.wikimedia.org/T389380 [08:26:36] (03Merged) 10jenkins-bot: zarcillo: Fix ingress and egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146602 (owner: 10Clément Goubert) [08:27:00] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:27:16] (03Merged) 10jenkins-bot: airflow-analytics-test: Bump parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146935 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:28:03] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:28:29] (03CR) 10FNegri: [C:03+1] "Very good idea. The pattern looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1146661 (https://phabricator.wikimedia.org/T394283) (owner: 10Majavah) [08:29:13] (03CR) 10Majavah: [V:03+1 C:03+2] ssh: Do not shell out for root SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1146661 (https://phabricator.wikimedia.org/T394283) (owner: 10Majavah) [08:30:51] (03CR) 10Hashar: [C:03+2] "CheckUser failed to clone due to Gerrit returning a `502` (T394472)" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146777 (owner: 10TrainBranchBot) [08:31:57] FIRING: HelmReleaseBadStatus: Helm release miscweb/design-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:36:57] RESOLVED: HelmReleaseBadStatus: Helm release miscweb/design-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:38:46] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@0b9e2aa]: Deploying artifacts for analytics_test manually [08:39:19] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@0b9e2aa]: Deploying artifacts for analytics_test manually (duration: 00m 51s) [08:39:20] (03CR) 10Vgutierrez: [C:03+1] external_cloud_vendors: discard 6to4 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1146934 (https://phabricator.wikimedia.org/T394474) (owner: 10Fabfur) [08:41:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146777 (owner: 10TrainBranchBot) [08:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:52:50] RESOLVED: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:53:54] (03CR) 10Fabfur: [C:03+2] external_cloud_vendors: discard 6to4 addresses [puppet] - 10https://gerrit.wikimedia.org/r/1146934 (https://phabricator.wikimedia.org/T394474) (owner: 10Fabfur) [08:56:30] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Data-Platform-SRE, 06Discovery-Search: Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10828996 (10Gehel) p:05Triage→03Medium [08:58:20] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@c2d660e]: Deploying artifacts for analytics_test manually [09:01:27] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery-Search, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10829022 (10Gehel) [09:02:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [09:08:15] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10829054 (10MatthewVernon) [09:15:12] (03PS1) 10Clément Goubert: mw-cron: Unsuspend growthexperiments-listtaskcounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146940 (https://phabricator.wikimedia.org/T394018) [09:16:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [09:19:00] (03PS2) 10Muehlenhoff: Remove krb1001 from list of KDCs [puppet] - 10https://gerrit.wikimedia.org/r/1145884 (https://phabricator.wikimedia.org/T390863) [09:19:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145884 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [09:19:23] (03PS1) 10Cathal Mooney: IPv6 Sanitize-in: adjust 6to4 filters to use 'orlonger' [homer/public] - 10https://gerrit.wikimedia.org/r/1146941 (https://phabricator.wikimedia.org/T394474) [09:19:58] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@c2d660e]: Deploying artifacts for analytics_test manually (duration: 21m 38s) [09:22:27] (03CR) 10Cathal Mooney: [C:03+2] IPv6 Sanitize-in: adjust 6to4 filters to use 'orlonger' [homer/public] - 10https://gerrit.wikimedia.org/r/1146941 (https://phabricator.wikimedia.org/T394474) (owner: 10Cathal Mooney) [09:22:58] (03Merged) 10jenkins-bot: IPv6 Sanitize-in: adjust 6to4 filters to use 'orlonger' [homer/public] - 10https://gerrit.wikimedia.org/r/1146941 (https://phabricator.wikimedia.org/T394474) (owner: 10Cathal Mooney) [09:24:12] !log btullis@deploy1003 Started deploy [airflow-dags/analytics_test@c2d660e]: Test [09:25:25] (03CR) 10Clément Goubert: [C:03+2] mw-cron: Unsuspend growthexperiments-listtaskcounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146940 (https://phabricator.wikimedia.org/T394018) (owner: 10Clément Goubert) [09:26:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [09:26:50] (03Merged) 10jenkins-bot: mw-cron: Unsuspend growthexperiments-listtaskcounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146940 (https://phabricator.wikimedia.org/T394018) (owner: 10Clément Goubert) [09:27:09] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [09:27:22] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [09:35:31] (03PS1) 10Jcrespo: mariadb: Test 10.11 backups on db1239 (standby backup source) [puppet] - 10https://gerrit.wikimedia.org/r/1146942 (https://phabricator.wikimedia.org/T394371) [09:35:51] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1146942 (https://phabricator.wikimedia.org/T394371) (owner: 10Jcrespo) [09:36:19] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146942 (https://phabricator.wikimedia.org/T394371) (owner: 10Jcrespo) [09:36:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [09:37:20] ^ working on fixing this atm [09:37:42] (03PS2) 10Jcrespo: mariadb: Test 10.11 backups on db1239 (standby backup source) [puppet] - 10https://gerrit.wikimedia.org/r/1146942 (https://phabricator.wikimedia.org/T394371) [09:38:31] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146942 (https://phabricator.wikimedia.org/T394371) (owner: 10Jcrespo) [09:41:58] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10829180 (10BWojtowicz-WMF) Thank you @BCornwal... [09:43:23] (03PS3) 10Jcrespo: mariadb: Test 10.11 backups on db1239 (standby backup source) [puppet] - 10https://gerrit.wikimedia.org/r/1146942 (https://phabricator.wikimedia.org/T394371) [09:44:17] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146942 (https://phabricator.wikimedia.org/T394371) (owner: 10Jcrespo) [09:44:25] FIRING: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:56] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10829184 (10MatthewVernon) Thanks for this! I was able to get in via `install_console`, and have a look. None of the hdds were available - I had to boot into BIOS and convert the... [09:46:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [09:49:03] !log btullis@deploy1003 Finished deploy [airflow-dags/analytics_test@c2d660e]: Test (duration: 24m 55s) [09:49:25] RESOLVED: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:42] !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@4351188]: Fix slf4j artifact sync [09:51:54] !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@4351188]: Fix slf4j artifact sync (duration: 00m 12s) [09:52:06] (03PS1) 10Kevin Bazira: Add vLLM image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) [09:59:06] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [10:01:45] (03CR) 10Elukey: "Left some comments to summarize what we discussed over IRC, I'll do another pass after more tests." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [10:06:10] (03CR) 10Volans: [C:04-1] "Looks good, thanks! Just minor changes due to what icinga-status returns and it's ready." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [10:07:47] (03PS3) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) [10:08:15] (03CR) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [10:10:10] (03PS4) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) [10:10:34] (03CR) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [10:12:29] (03CR) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [10:14:04] (03CR) 10Elukey: "Still WIP at this point, going to work on it later on!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [10:14:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:16:04] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet,ms-backup1002.eqiad.wmnet with reason: Upgrade and test [10:16:50] (03CR) 10Jcrespo: [C:03+2] "Assuming ok on IRC as a virtual +1 from manuel." [puppet] - 10https://gerrit.wikimedia.org/r/1146942 (https://phabricator.wikimedia.org/T394371) (owner: 10Jcrespo) [10:18:10] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10829274 (10MoritzMuehlenhoff) [10:21:05] (03CR) 10Clément Goubert: [C:04-1] "Blocked until we can guarantee full runs through `concurrencyPolicy: Forbid` + `startingDeadlineSeconds`" [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:22:00] !log upgrading db1239 MariaDB server T394487 [10:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:03] T394487: Migrate backup sources to MariaDB 10.11 - https://phabricator.wikimedia.org/T394487 [10:23:54] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489 (10MoritzMuehlenhoff) 03NEW [10:24:02] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10829303 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:26:13] !log joal@deploy1003 Started deploy [airflow-dags/main@4351188]: Deploying main instead of analytics subfolder [10:28:04] !log joal@deploy1003 Finished deploy [airflow-dags/main@4351188]: Deploying main instead of analytics subfolder (duration: 01m 51s) [10:31:46] (03PS12) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) [10:38:58] (03CR) 10Ladsgroup: [C:03+1] sections.yaml: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146871 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [10:40:17] (03CR) 10Brouberol: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [10:41:22] (03CR) 10Btullis: [C:03+1] airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [10:41:43] (03CR) 10Ladsgroup: [C:03+1] "LGTM. Also https://integration.wikimedia.org/ci/job/operations-mw-config-php81-composer-diffConfig/636/console says that no config for any" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman) [10:43:46] !log joal@deploy1003 Started deploy [airflow-dags/analytics@4351188]: Deploying analytics with artifact-cache warming using main folder [10:44:36] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@4351188]: Deploying analytics with artifact-cache warming using main folder (duration: 00m 49s) [10:45:15] (03PS1) 10Fabfur: external_cloud_vendors: fix missing None case [puppet] - 10https://gerrit.wikimedia.org/r/1146951 (https://phabricator.wikimedia.org/T394474) [10:46:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:47:49] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM but amend the commit message to save time for posterity." [puppet] - 10https://gerrit.wikimedia.org/r/1146951 (https://phabricator.wikimedia.org/T394474) (owner: 10Fabfur) [10:48:39] (03CR) 10Vgutierrez: [C:03+1] "LGTM (ditto for commit messsage)" [puppet] - 10https://gerrit.wikimedia.org/r/1146951 (https://phabricator.wikimedia.org/T394474) (owner: 10Fabfur) [10:50:08] (03CR) 10Dr0ptp4kt: [C:03+1] cache::haproxy: Drop incoming X-Experiment-Enrollments header [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [10:50:09] (03PS2) 10Fabfur: external_cloud_vendors: re-add None case from list of discarded nets [puppet] - 10https://gerrit.wikimedia.org/r/1146951 (https://phabricator.wikimedia.org/T394474) [10:50:49] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Drop incoming X-Experiment-Enrollments header [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [10:54:06] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [10:55:15] (03PS3) 10Fabfur: external_cloud_vendors: re-add None case from list of discarded nets [puppet] - 10https://gerrit.wikimedia.org/r/1146951 (https://phabricator.wikimedia.org/T394474) [10:56:00] (03CR) 10Fabfur: external_cloud_vendors: re-add None case from list of discarded nets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146951 (https://phabricator.wikimedia.org/T394474) (owner: 10Fabfur) [10:56:19] (03PS13) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) [10:56:35] (03CR) 10Fabfur: [C:03+2] external_cloud_vendors: re-add None case from list of discarded nets [puppet] - 10https://gerrit.wikimedia.org/r/1146951 (https://phabricator.wikimedia.org/T394474) (owner: 10Fabfur) [10:59:01] (03PS1) 10Arturo Borrero Gonzalez: notify_maintainers: ignore toolsbeta-tofu [puppet] - 10https://gerrit.wikimedia.org/r/1146952 (https://phabricator.wikimedia.org/T394453) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250516T0700) [11:00:04] jelto, arnoldokoth, and mutante: That opportune time for a GitLab version upgrades deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250516T1100). [11:04:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10829442 (10FCeratto-WMF) 05Open→03In progress p:05Triage→03High a:05VRiley-WMF→03FCeratto-WMF [11:08:13] (03PS1) 10Ladsgroup: tables-catalog: Add existencelinks table [puppet] - 10https://gerrit.wikimedia.org/r/1146954 (https://phabricator.wikimedia.org/T14019) [11:10:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [11:19:20] (03PS3) 10Hnowlan: sre:api-gateway: only alert for core API services [alerts] - 10https://gerrit.wikimedia.org/r/1146668 [11:19:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1214 from x3, remove db1257 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76261 and previous config saved to /var/cache/conftool/dbconfig/20250516-111952-ladsgroup.json [11:19:56] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [11:20:41] (03PS1) 10MVernon: autoinstall: setup for new apus nodes with boss card [puppet] - 10https://gerrit.wikimedia.org/r/1146957 (https://phabricator.wikimedia.org/T392844) [11:20:56] (03CR) 10CI reject: [V:04-1] sre:api-gateway: only alert for core API services [alerts] - 10https://gerrit.wikimedia.org/r/1146668 (owner: 10Hnowlan) [11:21:27] (03PS2) 10MVernon: autoinstall: setup for new apus nodes with boss card [puppet] - 10https://gerrit.wikimedia.org/r/1146957 (https://phabricator.wikimedia.org/T392844) [11:23:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db2242 from x3, remove db2154 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76262 and previous config saved to /var/cache/conftool/dbconfig/20250516-112345-ladsgroup.json [11:25:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [11:29:34] (03PS4) 10Hnowlan: sre:api-gateway: only alert for core API services [alerts] - 10https://gerrit.wikimedia.org/r/1146668 [11:30:26] (03PS2) 10Kevin Bazira: Add vLLM image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) [11:31:17] (03CR) 10Kevin Bazira: Add vLLM image (039 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [11:38:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [11:42:00] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1188.eqiad.wmnet onto db1246.eqiad.wmnet [11:42:01] (03PS1) 10Muehlenhoff: sshd: Remove dead template argument [puppet] - 10https://gerrit.wikimedia.org/r/1146968 (https://phabricator.wikimedia.org/T393762) [11:42:03] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1188 - Depool db1188.eqiad.wmnet to then clone it to db1246.eqiad.wmnet - fceratto@cumin1002 [11:42:09] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10829547 (10ops-monitoring-bot) Started cloning db1188.eqiad.wmnet to db1246.eqiad.wmnet - fceratto@cumin1002 [11:42:20] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1188 - Depool db1188.eqiad.wmnet to then clone it to db1246.eqiad.wmnet - fceratto@cumin1002 [11:42:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10829550 (10ops-monitoring-bot) Completed depool of db1188 - Depool db1188.eqiad.wmnet to then clone it to db1246.eqiad.wmnet - fceratto@cumin1002 - fceratto@cumin1002 [11:45:04] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10829555 (10BTullis) Just wondering @RobH, why do an-coord100[3-4] say //(decommissioned)// ? [11:46:22] (03PS1) 10Máté Szabó: Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) [11:46:58] fceratto@cumin1002 clone (PID 2505642) is awaiting input [11:48:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146968 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [11:49:14] 10ops-eqiad, 06DC-Ops: SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498 (10BTullis) 03NEW [11:49:25] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:42] (03CR) 10Kamila Součková: [C:03+1] sre:api-gateway: only alert for core API services [alerts] - 10https://gerrit.wikimedia.org/r/1146668 (owner: 10Hnowlan) [11:52:41] 10ops-eqiad, 06DC-Ops: SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499 (10BTullis) 03NEW [11:53:29] 10ops-eqiad, 06DC-Ops: SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10829612 (10BTullis) a:03RobH [11:54:17] (03CR) 10Kamila Součková: [C:03+2] mw::maintenance: migrate growthexperiments-updateIsActiveFlagForMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146566 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [11:54:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:55:03] (03CR) 10Hnowlan: [C:03+2] sre:api-gateway: only alert for core API services [alerts] - 10https://gerrit.wikimedia.org/r/1146668 (owner: 10Hnowlan) [11:56:17] (03Merged) 10jenkins-bot: sre:api-gateway: only alert for core API services [alerts] - 10https://gerrit.wikimedia.org/r/1146668 (owner: 10Hnowlan) [11:57:22] (03PS1) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) [11:57:27] (03CR) 10Btullis: [C:03+1] airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [11:58:26] (03CR) 10Btullis: [C:03+1] airflow: use the devenv.db.name in the PG URI instead of /app (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146670 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [11:58:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10829620 (10FCeratto-WMF) The host is not pinging, responding to ssh nor sending host-level metrics: https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&from=2025-05-07T10%3A00%3A09.186Z... [11:59:21] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1188.eqiad.wmnet onto db1246.eqiad.wmnet [12:00:13] (03CR) 10Btullis: [C:03+1] airflow: rely on krenew instead of 'airflow kerberos' to renew the kerberos ticket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146671 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [12:01:13] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [12:01:15] (03CR) 10Btullis: [C:03+1] airflow: define an airflow-dev values file, containing the devenv default values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146672 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [12:01:16] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:01:31] (03CR) 10Btullis: [C:03+1] airflow: don't define OAUTH-related configs in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146673 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [12:01:47] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:01] (03CR) 10Btullis: [C:03+1] airflow: include an ENVOY_SERVICE_NAME env var pointing to the envoy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146693 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [12:02:05] (03CR) 10Jcrespo: [C:03+1] ""I think this will not break production"" [puppet] - 10https://gerrit.wikimedia.org/r/1146957 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon) [12:02:09] (03CR) 10Btullis: [C:03+1] airflow: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146674 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [12:28:16] !log mvernon@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host apus-be1004.eqiad.wmnet with OS bookworm [12:28:27] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10829732 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm executed... [12:32:16] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host apus-be1004.eqiad.wmnet with OS bookworm [12:32:24] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10829748 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm [12:33:52] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1188 gradually with 4 steps - Pooling back in [12:35:15] (03CR) 10Kosta Harlan: Update IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [12:35:38] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1188 gradually with 4 steps - Pooling back in [12:36:28] (03CR) 10Kosta Harlan: Update IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [12:40:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10829759 (10Marostegui) The host crashed yesterday with the same error as always: ` ------------------------------------------------------------------------------- Record: 2 Date/Time: 05/15/202... [12:40:43] (03PS1) 10Hashar: wm-zuul-status: do not popup when navigating changes [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1146976 (https://phabricator.wikimedia.org/T394485) [12:41:52] (03CR) 10Kosta Harlan: Update IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [12:42:23] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:43:18] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:43:58] (03CR) 10Marostegui: [C:03+2] sections.yaml: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146871 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [12:45:18] (03PS1) 10MVernon: boss_leavelvm: specify boot device [puppet] - 10https://gerrit.wikimedia.org/r/1146977 (https://phabricator.wikimedia.org/T392844) [12:46:47] !log mvernon@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host apus-be1004.eqiad.wmnet with OS bookworm [12:46:58] (03CR) 10Jcrespo: [C:03+1] boss_leavelvm: specify boot device [puppet] - 10https://gerrit.wikimedia.org/r/1146977 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon) [12:47:00] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10829772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm executed... [12:47:02] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:19] (03CR) 10MVernon: [C:03+2] boss_leavelvm: specify boot device [puppet] - 10https://gerrit.wikimedia.org/r/1146977 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon) [12:48:23] (03CR) 10Vgutierrez: [C:04-2] "do not merge till https://phabricator.wikimedia.org/T394437 is done" [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [12:49:47] (03CR) 10Stevemunene: [C:03+2] airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [12:50:55] (03PS1) 10Muehlenhoff: Deprecate AQS-related groups [puppet] - 10https://gerrit.wikimedia.org/r/1146978 [12:51:48] (03Merged) 10jenkins-bot: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [12:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:52:13] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host apus-be1004.eqiad.wmnet with OS bookworm [12:52:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10829778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm [12:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:57:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146978 (owner: 10Muehlenhoff) [12:57:25] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:59:40] (03CR) 10Máté Szabó: Update IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:00:28] !log joal@deploy1003 Started deploy [airflow-dags/analytics@4351188]: Fix gobblin artifacts [13:00:36] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@4351188]: Fix gobblin artifacts (duration: 00m 07s) [13:01:54] !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@4ebb376]: Fix gobblin artifacts [13:02:11] !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@4ebb376]: Fix gobblin artifacts (duration: 00m 16s) [13:02:30] !log joal@deploy1003 Started deploy [airflow-dags/analytics@4ebb376]: Fix gobblin artifacts (after pulling code...) [13:03:32] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@4ebb376]: Fix gobblin artifacts (after pulling code...) (duration: 01m 01s) [13:05:10] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1188 gradually with 4 steps - Pooling back in [13:05:21] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10829841 (10ops-monitoring-bot) Start pool of db1188 gradually with 4 steps - Pooling back in - fceratto@cumin1002 [13:06:04] (03PS1) 10ZhaoFJx: Add zh, en, and meta to zh_arbcom import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146985 (https://phabricator.wikimedia.org/T394505) [13:08:22] (03PS2) 10Muehlenhoff: Deprecate AQS-related groups [puppet] - 10https://gerrit.wikimedia.org/r/1146978 [13:08:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146978 (owner: 10Muehlenhoff) [13:10:18] (03CR) 10Kosta Harlan: Update IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:13:04] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [13:13:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10829850 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1... [13:13:26] (03PS1) 10MVernon: site.pp: cephadm::storage role for apus be nodes [puppet] - 10https://gerrit.wikimedia.org/r/1146986 (https://phabricator.wikimedia.org/T392844) [13:14:40] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1146986 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon) [13:14:55] (03CR) 10MVernon: [C:03+2] site.pp: cephadm::storage role for apus be nodes [puppet] - 10https://gerrit.wikimedia.org/r/1146986 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon) [13:17:49] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-be1004.eqiad.wmnet with reason: host reimage [13:19:08] (03CR) 10Slyngshede: [C:03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/1146978 (owner: 10Muehlenhoff) [13:19:17] (03Abandoned) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:19:52] (03CR) 10Hashar: [C:03+2] "Verified with the browser debugger." [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1146976 (https://phabricator.wikimedia.org/T394485) (owner: 10Hashar) [13:20:25] (03Merged) 10jenkins-bot: wm-zuul-status: do not popup when navigating changes [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1146976 (https://phabricator.wikimedia.org/T394485) (owner: 10Hashar) [13:21:43] !log hashar@deploy1003 Started deploy [gerrit/gerrit@fcb893c]: wm-zuul-status: do not popup when navigating changes - T394485 [13:21:45] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-be1004.eqiad.wmnet with reason: host reimage [13:21:47] T394485: Incorrect "CI has completed checks" popup appears when navigating from a change with tests in progress to one with no tests in progress - https://phabricator.wikimedia.org/T394485 [13:21:55] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@fcb893c]: wm-zuul-status: do not popup when navigating changes - T394485 (duration: 00m 12s) [13:22:31] (03PS1) 10Muehlenhoff: sshd: Remove ineffective configuration "Protocol" directive [puppet] - 10https://gerrit.wikimedia.org/r/1146989 (https://phabricator.wikimedia.org/T393762) [13:28:23] jclark@cumin1002 reimage (PID 2599410) is awaiting input [13:30:12] (03PS1) 10Effie Mouzeli: WIP: create allow-hostpath-mediawiki policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 [13:31:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146989 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:31:44] (03PS9) 10Brouberol: airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) [13:31:44] (03PS2) 10Brouberol: airflow: use the devenv.db.name in the PG URI instead of /app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146670 (https://phabricator.wikimedia.org/T393999) [13:31:44] (03PS2) 10Brouberol: airflow: rely on krenew instead of 'airflow kerberos' to renew the kerberos ticket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146671 (https://phabricator.wikimedia.org/T393999) [13:31:44] (03PS2) 10Brouberol: airflow: define an airflow-dev values file, containing the devenv default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146672 (https://phabricator.wikimedia.org/T393999) [13:31:45] (03PS2) 10Brouberol: airflow: don't define OAUTH-related configs in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146673 (https://phabricator.wikimedia.org/T393999) [13:31:46] (03PS2) 10Brouberol: airflow: include an ENVOY_SERVICE_NAME env var pointing to the envoy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146693 (https://phabricator.wikimedia.org/T393999) [13:31:51] (03PS3) 10Brouberol: airflow: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146674 (https://phabricator.wikimedia.org/T393999) [13:32:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10829944 (10Papaul) @FCeratto-WMF @Marostegui yes I will talk with @wiki_willy on getting a replacement. Thank you [13:34:19] (03PS10) 10Brouberol: airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) [13:34:19] (03PS3) 10Brouberol: airflow: use the devenv.db.name in the PG URI instead of /app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146670 (https://phabricator.wikimedia.org/T393999) [13:34:19] (03PS3) 10Brouberol: airflow: rely on krenew instead of 'airflow kerberos' to renew the kerberos ticket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146671 (https://phabricator.wikimedia.org/T393999) [13:34:20] (03PS3) 10Brouberol: airflow: define an airflow-dev values file, containing the devenv default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146672 (https://phabricator.wikimedia.org/T393999) [13:34:21] (03PS3) 10Brouberol: airflow: don't define OAUTH-related configs in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146673 (https://phabricator.wikimedia.org/T393999) [13:34:22] (03PS3) 10Brouberol: airflow: include an ENVOY_SERVICE_NAME env var pointing to the envoy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146693 (https://phabricator.wikimedia.org/T393999) [13:34:26] (03PS1) 10Brouberol: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146993 (https://phabricator.wikimedia.org/T393999) [13:34:34] (03CR) 10CI reject: [V:04-1] airflow: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146674 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:34:47] (03PS4) 10Brouberol: airflow: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146674 (https://phabricator.wikimedia.org/T393999) [13:35:30] (03Abandoned) 10Brouberol: airflow: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146674 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:35:36] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1177.eqiad.wmnet with OS bullseye [13:35:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10829972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1177.... [13:36:50] (03CR) 10Brouberol: [C:03+2] airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:37:16] (03CR) 10Brouberol: [C:03+2] airflow: use the devenv.db.name in the PG URI instead of /app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146670 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:37:20] (03PS2) 10Klausman: aptrepo: Import AMD ROCm 6.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/1146991 (https://phabricator.wikimedia.org/T385173) [13:37:20] (03CR) 10Brouberol: [C:03+2] airflow: rely on krenew instead of 'airflow kerberos' to renew the kerberos ticket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146671 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:37:23] (03CR) 10Brouberol: [C:03+2] airflow: define an airflow-dev values file, containing the devenv default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146672 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:37:28] (03CR) 10Brouberol: [C:03+2] airflow: don't define OAUTH-related configs in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146673 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:37:31] (03CR) 10Brouberol: [C:03+2] airflow: include an ENVOY_SERVICE_NAME env var pointing to the envoy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146693 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:37:34] (03CR) 10Brouberol: [C:03+2] Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146993 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:38:46] (03Merged) 10jenkins-bot: airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:39:17] (03Merged) 10jenkins-bot: airflow: use the devenv.db.name in the PG URI instead of /app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146670 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:39:24] (03PS5) 10Ilias Sarantopoulos: ores-extension: enable ores extention for rrla without the UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) [13:39:31] (03PS6) 10Ilias Sarantopoulos: ores-extension: enable ores extention for rrla without the UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) [13:39:32] (03Merged) 10jenkins-bot: airflow: rely on krenew instead of 'airflow kerberos' to renew the kerberos ticket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146671 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:39:34] (03Merged) 10jenkins-bot: airflow: define an airflow-dev values file, containing the devenv default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146672 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:39:35] (03Merged) 10jenkins-bot: airflow: don't define OAUTH-related configs in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146673 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:39:36] (03Merged) 10jenkins-bot: airflow: include an ENVOY_SERVICE_NAME env var pointing to the envoy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146693 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:39:55] (03Merged) 10jenkins-bot: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146993 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [13:40:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [13:40:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10829997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1... [13:40:42] (03CR) 10Ilias Sarantopoulos: "We'd first need to create the tables before we backport this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [13:45:09] (03CR) 10JHathaway: [C:03+1] sshd: Remove dead template argument [puppet] - 10https://gerrit.wikimedia.org/r/1146968 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:47:21] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1238.eqiad.wmnet onto db1247.eqiad.wmnet [13:47:24] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1238 - Depool db1238.eqiad.wmnet to then clone it to db1247.eqiad.wmnet - fceratto@cumin1002 [13:47:52] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1238 - Depool db1238.eqiad.wmnet to then clone it to db1247.eqiad.wmnet - fceratto@cumin1002 [13:47:57] (03CR) 10Elukey: [C:03+1] aptrepo: Import AMD ROCm 6.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/1146991 (https://phabricator.wikimedia.org/T385173) (owner: 10Klausman) [13:48:20] (03CR) 10Klausman: [C:03+2] aptrepo: Import AMD ROCm 6.3 packages [puppet] - 10https://gerrit.wikimedia.org/r/1146991 (https://phabricator.wikimedia.org/T385173) (owner: 10Klausman) [13:49:09] (03CR) 10Federico Ceratto: hiera: Add zarcillo k8s service on traffic server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [13:50:08] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1238.eqiad.wmnet onto db1247.eqiad.wmnet [13:50:35] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1188 gradually with 4 steps - Pooling back in [13:50:41] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10830016 (10ops-monitoring-bot) Completed pool of db1188 gradually with 4 steps - Pooling back in - fceratto@cumin1002 [13:51:10] (03CR) 10Eevans: [C:03+1] Deprecate AQS-related groups [puppet] - 10https://gerrit.wikimedia.org/r/1146978 (owner: 10Muehlenhoff) [13:52:52] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1238.eqiad.wmnet onto db1247.eqiad.wmnet [13:54:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db2166 and db1177 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76270 and previous config saved to /var/cache/conftool/dbconfig/20250516-135438-ladsgroup.json [13:54:42] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [13:55:41] (03Abandoned) 10Vgutierrez: hiera: Split ATS cache on X-Experiment-Enrollments [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:57:30] jclark@cumin1002 reimage (PID 2628940) is awaiting input [14:06:41] (03PS1) 10Majavah: systemd: Do not try to validate overrides [puppet] - 10https://gerrit.wikimedia.org/r/1146998 [14:08:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1177.eqiad.wmnet with OS bullseye [14:08:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10830090 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1177.... [14:09:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [14:09:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10830091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1... [14:10:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5578/console" [puppet] - 10https://gerrit.wikimedia.org/r/1146998 (owner: 10Majavah) [14:11:25] !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-backup1002.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [14:11:36] (03PS1) 10JHathaway: Revert "systemd: validate units" [puppet] - 10https://gerrit.wikimedia.org/r/1147001 [14:14:23] (03CR) 10MVernon: [C:03+1] "Thanks for fixing this, and apologies for the hassle!" [puppet] - 10https://gerrit.wikimedia.org/r/1147001 (owner: 10JHathaway) [14:14:33] (03CR) 10JHathaway: [C:03+2] Revert "systemd: validate units" [puppet] - 10https://gerrit.wikimedia.org/r/1147001 (owner: 10JHathaway) [14:14:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:22:12] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [14:22:35] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [14:22:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-be1004.eqiad.wmnet with OS bookworm [14:22:41] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10830123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm completed: - apus-be1004 (**PA... [14:23:53] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10830133 (10elukey) >>! In T375645#10194826, @elukey wrote: > It failed with: > > ` > failed to garbage collect: failed to mark: swift: sw... [14:24:22] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10830134 (10MatthewVernon) @Jclark-ctr system imaged OK now. I don't know if you have more you want to do before closing this ticket out? [14:25:00] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10830135 (10MatthewVernon) [14:25:10] jclark@cumin1002 reimage (PID 2658597) is awaiting input [14:29:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10830167 (10MatthewVernon) @Jhancock.wm we had some fun with the eqiad equivalent system, but did get it properly installed (and the preseed done such that this system should als... [14:31:55] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: migrate echo_mail_batch to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1146875 (https://phabricator.wikimedia.org/T394471) (owner: 10Clément Goubert) [14:32:12] (03CR) 10Hnowlan: [C:03+1] alertmanager: Add notifications-echo task creation route [puppet] - 10https://gerrit.wikimedia.org/r/1146874 (https://phabricator.wikimedia.org/T394471) (owner: 10Clément Goubert) [14:32:14] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10830184 (10Jclark-ctr) @MatthewVernon since i had to disable TLS Unsure if thats something @Volans should look at This is the first Boss Card That i am aware of. Did you... [14:35:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10830188 (10MatthewVernon) @Jclark-ctr the only BIOS/iDRAC changes I made were to set all the hdds to be non-RAID. Everything else was fixing puppet/preseed configs. [14:35:51] (03CR) 10JHathaway: [C:03+1] sshd: Remove ineffective configuration "Protocol" directive [puppet] - 10https://gerrit.wikimedia.org/r/1146989 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:35:59] (03PS1) 10Cwhite: mx: stop relaying postfix logs [puppet] - 10https://gerrit.wikimedia.org/r/1147009 (https://phabricator.wikimedia.org/T394514) [14:37:38] (03CR) 10Elukey: run_ci_locally.sh: use bind mounts for local runs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [14:38:59] (03CR) 10Elukey: [C:03+1] "I have to admit that I don't have the full picture of Rakefile, but the compromise that it proposed in this CR seems worth to test more br" [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [14:41:31] (03CR) 10Ladsgroup: [C:03+1] mx: stop relaying postfix logs [puppet] - 10https://gerrit.wikimedia.org/r/1147009 (https://phabricator.wikimedia.org/T394514) (owner: 10Cwhite) [14:42:17] (03CR) 10JHathaway: [C:03+1] mx: stop relaying postfix logs [puppet] - 10https://gerrit.wikimedia.org/r/1147009 (https://phabricator.wikimedia.org/T394514) (owner: 10Cwhite) [14:44:01] (03CR) 10Bunnypranav: [C:03+1] Add zh, en, and meta to zh_arbcom import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146985 (https://phabricator.wikimedia.org/T394505) (owner: 10ZhaoFJx) [14:44:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:44:51] (03CR) 10Cwhite: [C:03+2] mx: stop relaying postfix logs [puppet] - 10https://gerrit.wikimedia.org/r/1147009 (https://phabricator.wikimedia.org/T394514) (owner: 10Cwhite) [14:44:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:45:57] (03PS3) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141201 (https://phabricator.wikimedia.org/T393281) [14:45:57] (03PS7) 10Hnowlan: mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T394423) [14:46:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [14:49:01] (03PS1) 10Cathal Mooney: New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) [14:49:30] (03CR) 10Scott French: [C:03+1] alertmanager: Add notifications-echo task creation route [puppet] - 10https://gerrit.wikimedia.org/r/1146874 (https://phabricator.wikimedia.org/T394471) (owner: 10Clément Goubert) [14:50:05] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate echo_mail_batch to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1146875 (https://phabricator.wikimedia.org/T394471) (owner: 10Clément Goubert) [14:53:33] (03CR) 10Hnowlan: [C:03+1] P:mw:maint:update_special_pages: remove absented non-sharded job [puppet] - 10https://gerrit.wikimedia.org/r/1146787 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [14:55:11] (03PS1) 10Cwhite: rsyslog: remove postfix entries from lookup table [puppet] - 10https://gerrit.wikimedia.org/r/1147015 (https://phabricator.wikimedia.org/T394514) [14:55:48] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: migrate growthexperiments-refreshPraiseworthyMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146569 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [14:58:52] (03CR) 10Cwhite: [C:03+2] rsyslog: remove postfix entries from lookup table [puppet] - 10https://gerrit.wikimedia.org/r/1147015 (https://phabricator.wikimedia.org/T394514) (owner: 10Cwhite) [14:58:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10830286 (10Volans) Do we have to modify anything in the provision cookbook for properly setting the iDRAC/BIOS for this setup? cc @elukey [15:00:50] (03PS23) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [15:00:50] (03CR) 10Vgutierrez: "text tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:04:31] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523 (10MBinder_WMF) 03NEW [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:43] (03CR) 10Vgutierrez: [C:04-1] hiera: Add zarcillo k8s service on traffic server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:10:41] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:10:48] (03PS1) 10Vgutierrez: Revert "trafficserver: Allow splitting the cache by HTTP header content" [puppet] - 10https://gerrit.wikimedia.org/r/1147016 (https://phabricator.wikimedia.org/T391411) [15:11:05] (03PS2) 10Vgutierrez: Revert "trafficserver: Allow splitting the cache by HTTP header content" [puppet] - 10https://gerrit.wikimedia.org/r/1147016 (https://phabricator.wikimedia.org/T391411) [15:11:15] (03PS3) 10Vgutierrez: Revert "trafficserver: Allow splitting the cache by HTTP header content" [puppet] - 10https://gerrit.wikimedia.org/r/1147016 (https://phabricator.wikimedia.org/T391411) [15:11:36] (03PS4) 10Vgutierrez: Revert "trafficserver: Allow splitting the cache by HTTP header content" [puppet] - 10https://gerrit.wikimedia.org/r/1147016 (https://phabricator.wikimedia.org/T391411) [15:11:50] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1147016 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:16:07] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:19:22] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1177.eqiad.wmnet with OS bullseye [15:19:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10830379 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1177.... [15:19:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [15:19:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10830380 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1... [15:22:31] (03CR) 10Greg Grossmeier: "tested and worked, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1146725 (owner: 10Greg Grossmeier) [15:32:18] (03CR) 10Scott French: [C:03+1] "Thanks, Dan! If we can confirm this is good to go in one shot, then I can work with you on Monday to get this live." [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) (owner: 10Dduvall) [15:36:02] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495#10830445 (10ABran-WMF) [15:36:10] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#10830446 (10ABran-WMF) [15:36:51] (03PS1) 10Vgutierrez: trafficserver: Add X-Experiment-Enrollments to Vary header [puppet] - 10https://gerrit.wikimedia.org/r/1147022 (https://phabricator.wikimedia.org/T391411) [15:37:56] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10830455 (10taavi) Hmm, are you intentionally using [[ https://ldap.toolforge.org/user/maxbinderwmf | a new developer account ]] for this instead of the [[ https://ldap.toolfor... [15:38:02] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1177.eqiad.wmnet with reason: host reimage [15:38:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10830457 (10Jclark-ctr) Settings i changed set bios UEFI NVMe Driver setting to "All Drives" instead of "Dell Qualified Only" Set port to Http boot and disable TLS Then set c... [15:38:44] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:39:38] (03CR) 10CI reject: [V:04-1] trafficserver: Add X-Experiment-Enrollments to Vary header [puppet] - 10https://gerrit.wikimedia.org/r/1147022 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:41:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1177.eqiad.wmnet with reason: host reimage [15:41:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:52] (03CR) 10Federico Ceratto: "Added timeout, see comment" [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:44:18] (03Abandoned) 10Novem Linguae: Stabilization: convert deprecated Xml methods to Html [extensions/FlaggedRevs] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146634 (https://phabricator.wikimedia.org/T394403) (owner: 10Jforrester) [15:49:24] (03PS3) 10Ladsgroup: Add x1 to DBRecordCache for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145243 (https://phabricator.wikimedia.org/T393513) [15:51:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10830526 (10cmooney) [15:52:42] (03CR) 10Ladsgroup: "ladsgroup@deploy1003:~$ dig -t srv _x1-analytics._tcp.eqiad.wmnet +short" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145243 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [15:54:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:40] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 3 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10830532 (10Silvan_WMDE) \o/ [16:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:03:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1177.eqiad.wmnet with OS bullseye [16:03:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10830540 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1177.... [16:03:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:04:25] (03PS2) 10Vgutierrez: trafficserver: Add X-Experiment-Enrollments to Vary header [puppet] - 10https://gerrit.wikimedia.org/r/1147022 (https://phabricator.wikimedia.org/T391411) [16:09:30] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: redefine IBGP clusters to support Unicast & EVPN - https://phabricator.wikimedia.org/T394530 (10cmooney) 03NEW [16:09:53] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: redefine IBGP definitions to support both Unicast & EVPN clusters - https://phabricator.wikimedia.org/T394530#10830559 (10cmooney) [16:11:04] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10830564 (10MBinder_WMF) I am not! That other account was set up on my behalf, I think, as part of another small issue wherein I needed SSH to a specific thing but not much els... [16:11:20] (03PS1) 10Eevans: aqs: cleanup Cassandra roles & grants [puppet] - 10https://gerrit.wikimedia.org/r/1147026 (https://phabricator.wikimedia.org/T313877) [16:11:41] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: redefine IBGP definitions to support both Unicast & EVPN clusters - https://phabricator.wikimedia.org/T394530#10830566 (10cmooney) p:05Triage→03Medium [16:13:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:17:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:20:46] (03PS1) 10Andrew Bogott: Horizon: maybe deploy octavia-dashboard in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1147027 [16:21:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10830583 (10Jclark-ctr) @Stevemunene imaged server and recreated VD for HDD [16:23:17] (03CR) 10Andrew Bogott: [C:03+2] Horizon: maybe deploy octavia-dashboard in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1147027 (owner: 10Andrew Bogott) [16:23:29] fceratto@cumin1002 clone (PID 2641279) is awaiting input [16:24:15] (03PS1) 10Btullis: Add a copy of the dump scripts that are in puppet [dumps] - 10https://gerrit.wikimedia.org/r/1147028 (https://phabricator.wikimedia.org/T394389) [16:25:08] (03PS2) 10Btullis: Add a copy of the dump scripts that are in puppet [dumps] - 10https://gerrit.wikimedia.org/r/1147028 (https://phabricator.wikimedia.org/T394389) [16:27:58] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump the version numbers for Java images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146638 (owner: 10Muehlenhoff) [16:39:01] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:43:04] uhhh [16:43:08] swift unhappy in codfw [16:44:01] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:45:22] load spike on ms-fe2011 [16:45:35] recovered I guess [16:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:47:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:52:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:52:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:57:25] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:01:35] 07Puppet: sync-puppet-ca timer broken - https://phabricator.wikimedia.org/T392628#10830684 (10jhathaway) 05Open→03Resolved [17:02:40] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10830688 (10Papaul) All the steps looks good to me thanks. [17:03:51] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1238 gradually with 4 steps - Pool db1238.eqiad.wmnet in after cloning [17:03:57] (03PS3) 10Btullis: Add a copy of the dump scripts that are in puppet [dumps] - 10https://gerrit.wikimedia.org/r/1147028 (https://phabricator.wikimedia.org/T394389) [17:06:17] (03PS3) 10Tchanders: Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) [17:08:29] (03CR) 10Tchanders: "We no longer need to wait to rename these groups, since the IP reveal right is assigned to them directly and they do not need to use this " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders) [17:09:32] (03PS3) 10Aleksandar Mastilovic: Removing WM Enterprise downloader Puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) [17:10:34] (03CR) 10Aleksandar Mastilovic: Removing WM Enterprise downloader Puppet configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [17:24:01] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:09] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10830790 (10RobH) @btullis, Thank you! I was trying to get a volunteer to let me test the first firmware updates before I roll out documentation. I'll work on an-coord1004.eqiad.wmnet, as lo... [17:33:48] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10830794 (10RobH) >>! In T394348#10829555, @BTullis wrote: > Just wondering @RobH, why do an-coord100[3-4] say //(decommissioned)// ? @wiki_willy: Please advise, I wasn't sure why the l... [17:39:23] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10830813 (10RobH) We haven't done a firmware upgrade on an SSD, so we need to do a test unit first from the list to fully document the downtime expectations. between an-coord1004.eqiad.wmnet... [17:41:00] (03PS1) 10Andrew Bogott: Update horizon version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1147037 [17:41:13] (03CR) 10CI reject: [V:04-1] Update horizon version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1147037 (owner: 10Andrew Bogott) [17:42:23] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10830817 (10wiki_willy) [17:44:23] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10830821 (10wiki_willy) Hi @BTullis - apologies for the mixup. For some reason, I had mixed up the dates with an-coord100[1,2], which are both offline. I've fixed the notes and removed... [17:44:33] (03PS2) 10Andrew Bogott: Update horizon version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1147037 [17:46:16] (03CR) 10Andrew Bogott: [C:03+2] Update horizon version for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1147037 (owner: 10Andrew Bogott) [17:49:04] (03PS2) 10Scott French: hieradata: switch mw-debug pinkunicorn to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1137498 (https://phabricator.wikimedia.org/T391057) [17:49:07] (03PS2) 10Scott French: mw-debug: switch mw-debug pinkunicorn to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137499 (https://phabricator.wikimedia.org/T391057) [17:49:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10830833 (10wiki_willy) Hey @VRiley-WMF & @Jclark-ctr - I remember you two were working on tracking down and consolidating all the Dell Support tickets that we've opened for this server. Can you send... [17:49:16] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1238 gradually with 4 steps - Pool db1238.eqiad.wmnet in after cloning [17:49:18] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1238.eqiad.wmnet onto db1247.eqiad.wmnet [17:51:44] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10830856 (10BTullis) >>! In T394499#10830790, @RobH wrote: > @btullis, > > Thank you! I was trying to get a volunteer to let me test the first firmware updates before I roll out documentation.... [17:55:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10830871 (10VRiley-WMF) Hey @wiki_willy Here is the list I have 188297490 - April 5th 2024 197398410 - September 10th 2024 198075128 - September 23rd 2024 200579927 - November 7th 2024 206617456 - Mar... [17:57:24] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:57:52] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:58:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10830877 (10wiki_willy) Awesome, thanks @VRiley-WMF! Can you do me one more favor and summarize what was replaced next to each ticket for each Tech Support request? [17:59:46] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:00:11] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:05:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:09:00] (03CR) 10Aleksandar Mastilovic: "Conflicts resolved." [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [18:10:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1247.eqiad.wmnet with reason: To be set up in a few days [18:10:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:12:41] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:12:52] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:13:11] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:14:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:19:48] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10831001 (10VRiley-WMF) Yes, There was only 3 dispatches on this issue. It goes as follows 188297490 - April 5th 2024 - Logs showed up "clean" on their end and didn't send anyone out. 197398410 - Sept... [18:24:01] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:06] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:25:22] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:25:33] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:26:54] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:27:19] ok, so its not a checksum error just the script hates that particular filename i suppose [18:27:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:27:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:29:01] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:32:27] (03CR) 10BCornwall: [C:03+1] Revert "trafficserver: Allow splitting the cache by HTTP header content" [puppet] - 10https://gerrit.wikimedia.org/r/1147016 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [18:33:28] (03PS1) 10Greg Grossmeier: Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147043 [18:34:01] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:34:01] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:51] !log robh@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:34:54] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 06Traffic, and 3 others: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803#10831043 (10Krinkle) [18:35:00] (03CR) 10CI reject: [V:04-1] Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147043 (owner: 10Greg Grossmeier) [18:36:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:42:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:42:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:43:03] (03PS1) 10Dwisehaupt: spf wikimedia.org: add community-crm SPF record [dns] - 10https://gerrit.wikimedia.org/r/1147046 (https://phabricator.wikimedia.org/T383715) [18:46:32] !log robh@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts an-coord1004.eqiad.wmnet [18:46:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [18:49:32] (03CR) 10BCornwall: [C:03+1] spf wikimedia.org: add community-crm SPF record [dns] - 10https://gerrit.wikimedia.org/r/1147046 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [19:00:50] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10831150 (10jhathaway) This broke @MatthewVernon's bootstrapping and was reverted: ` mvernon@moss-be1001:~$ systemd-analyze verify --recursive-errors=no /etc/systemd/system/systemd-timesyncd.service.d/puppet-override.conf:systemd-times... [19:04:04] !log robh@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1004.eqiad.wmnet [19:04:11] !log robh@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts an-coord1004.eqiad.wmnet [19:04:12] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10831155 (10BCornwall) My guess is that your old one would remain as a personal account and that MaxBinderWMF would be for official representation (given that the WMF suffix is... [19:05:02] !log robh@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1004.eqiad.wmnet [19:05:08] !log robh@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts an-coord1004.eqiad.wmnet [19:06:05] !log robh@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1004.eqiad.wmnet [19:06:10] !log robh@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts an-coord1004.eqiad.wmnet [19:06:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10831157 (10wiki_willy) Perfect, thanks @VRiley-WMF! I just sent an email out to our Dell Account team and cc'd you and John on it. [19:10:22] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10831173 (10RobH) [19:10:26] hey i've got a patch that's stuck in "ready to submit" state and it seems to have run the gate-and-submit jobs but ........ it hasn't merged. is it waiting on something else? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Chart/+/1139132 [19:13:11] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10831176 (10MoritzMuehlenhoff) A third option would be to have systemctl expand the combination of systemd unit and the overrides(s): "systemctl cat systemd-timesyncd" shows the effective combination of both, we could also verify on thi... [19:15:16] (03PS1) 10Andrew Bogott: Horizon: update version in codfw1dev again [puppet] - 10https://gerrit.wikimedia.org/r/1147050 [19:15:20] aha [19:15:31] if i manually hit 'submit' hiding in the toolbar: [19:15:32] An error occurred [19:15:32] Could not perform action: Failed to submit 1 change due to the following problems: [19:15:32] Change 1139132: Depends on commit that cannot be merged. Commit bdfef671bbbdd1ea10b74dfb88852362bdfa7d1b depends on commit cacbed5a182eb708ebeb14d2f26dfa5278a1d431, which is outdated patch set 3 of change 1139128. The latest patch set is 4. [19:16:19] trying rebasing it [19:17:22] (03PS2) 10Andrew Bogott: Horizon: update version in codfw1dev again [puppet] - 10https://gerrit.wikimedia.org/r/1147050 [19:18:58] (03CR) 10Andrew Bogott: [C:03+2] Horizon: update version in codfw1dev again [puppet] - 10https://gerrit.wikimedia.org/r/1147050 (owner: 10Andrew Bogott) [19:22:53] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10831197 (10RobH) p:05Triage→03Medium [19:24:15] 10ops-eqiad, 06SRE, 06DC-Ops: SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10831216 (10RobH) Updated firmware on the idrac interface to 7 and then from DL70 to DL7C on both disks. The update cookbook failed, so did it manually and found out a reboot is required for th... [19:24:43] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10831220 (10RobH) The firmware directory on the cumin hosts has a STORAGE directory, but the cookbook has a failure documented on T394543 for repair before we can roll out the firmware u... [19:36:30] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10831269 (10jhathaway) >>! In T392629#10831176, @MoritzMuehlenhoff wrote: > A third option would be to have systemctl expand the combination of systemd unit and the overrides(s): "systemctl cat systemd-timesyncd" shows the effective com... [19:50:09] (03PS1) 10Andrew Bogott: Update codfw1dev Horizon version again [puppet] - 10https://gerrit.wikimedia.org/r/1147056 [19:51:19] (03CR) 10Andrew Bogott: [C:03+2] Update codfw1dev Horizon version again [puppet] - 10https://gerrit.wikimedia.org/r/1147056 (owner: 10Andrew Bogott) [19:54:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:59:19] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10831292 (10MBinder_WMF) hmm, well, I do use the other one for Phabricator batch edit silencing, though I don't know that that would always have to be done under an official ac... [20:01:21] (03CR) 10Xcollazo: Removing WM Enterprise downloader Puppet configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [20:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:41] (03CR) 10Novem Linguae: "See T394542 for discussion about the CI failure and how to fix." [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147043 (owner: 10Greg Grossmeier) [20:14:25] (03PS1) 10Bvibber: Render Data:.chart page reviews in user language [extensions/Chart] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147059 (https://phabricator.wikimedia.org/T392725) [20:16:12] (03PS1) 10Reedy: Update incorrect PHP namespace in BundleSizeTest [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147060 (https://phabricator.wikimedia.org/T373017) [20:16:34] (03PS2) 10Greg Grossmeier: Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147043 [20:18:20] (03CR) 10CI reject: [V:04-1] Update incorrect PHP namespace in BundleSizeTest [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147060 (https://phabricator.wikimedia.org/T373017) (owner: 10Reedy) [20:18:46] (03CR) 10CI reject: [V:04-1] Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147043 (owner: 10Greg Grossmeier) [20:20:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:21:38] hmm [20:21:50] !incidents [20:21:50] 6137 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [20:21:50] 6135 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [20:21:51] 6134 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr4-ulsfo.wikimedia.org) [20:21:51] 6133 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [20:21:51] 6132 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr4-ulsfo.wikimedia.org) [20:21:51] 6130 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr4-ulsfo.wikimedia.org) [20:21:51] 6131 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [20:21:52] 6128 (RESOLVED) GatewayBackendErrorsHigh sre (mobileapps_cluster rest-gateway eqiad) [20:22:02] !ack 6137 [20:22:02] 6137 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [20:23:27] Hey all - I’d like to try a quick, low-risk miscweb deployment now, unless there are any objections. [20:25:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:26:36] !log titan100[12] systemctl restart thanos-query [20:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:48] !log sbassett@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [20:41:02] (03CR) 10Jdlrobson: [C:03+1] Render Data:.chart page reviews in user language [extensions/Chart] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147059 (https://phabricator.wikimedia.org/T392725) (owner: 10Bvibber) [20:41:55] !log sbassett@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [20:42:31] !log sbassett@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [20:47:02] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:52:44] !log sbassett@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [20:52:46] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10831440 (10BCornwall) I'm unfortunately not the right person to ask for that! I'd get in touch with your manager and see what the bigger wigs say. [20:57:25] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:51] !log sbassett@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [21:13:23] (03PS2) 10Alexandros Kosiaris: calico: Set veth_mtu to 1480 for staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T352956) [21:13:27] (03CR) 10Alexandros Kosiaris: calico: Set veth_mtu to 1480 for staging-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [21:16:21] !log sbassett@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [21:16:33] !log sbassett@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [21:16:40] !log sbassett@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [21:16:46] !log sbassett@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [21:16:49] !log sbassett@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [21:22:25] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10831544 (10MoritzMuehlenhoff) >>! In T394523#10831292, @MBinder_WMF wrote: > hmm, well, I do use the other one for Phabricator batch edit silencing, though I don't know that t... [22:14:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:36:03] (03CR) 10Dreamy Jazz: Update IPInfo access levels (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [22:38:04] (03CR) 10Dreamy Jazz: Update IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [22:39:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10831666 (10Jhancock.wm) [22:40:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10831667 (10Jhancock.wm) [22:42:06] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10831668 (10Jhancock.wm) [22:42:27] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10831671 (10Jhancock.wm) [22:42:58] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10831672 (10Jhancock.wm) [22:43:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10831673 (10Jhancock.wm) [22:43:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10831674 (10Jhancock.wm) [22:43:59] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10831687 (10Jhancock.wm) [22:46:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [22:58:24] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [23:01:50] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding thanos-be2006 to codfw - jhancock@cumin2002" [23:01:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding thanos-be2006 to codfw - jhancock@cumin2002" [23:01:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:06:34] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [23:09:38] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2047 to codfw - jhancock@cumin2002" [23:09:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2047 to codfw - jhancock@cumin2002" [23:09:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:12:22] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host apus-be2004 [23:12:27] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-be2006 [23:12:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-be2007 [23:12:29] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-be2008 [23:12:30] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-be2009 [23:12:31] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [23:12:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-be2006 [23:12:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host apus-be2004 [23:12:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-be2007 [23:12:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-be2008 [23:12:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-be2009 [23:12:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2003 [23:13:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host apus-be2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:13:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:14:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:14:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:15:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:15:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:20:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:26:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:27:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-be2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:27:29] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:27:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:38:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1147072 [23:38:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1147072 (owner: 10TrainBranchBot) [23:40:16] jhancock@cumin2002 provision (PID 441115) is awaiting input [23:40:20] jhancock@cumin2002 provision (PID 441581) is awaiting input [23:40:26] jhancock@cumin2002 provision (PID 440960) is awaiting input [23:43:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:43:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:43:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:50:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1147072 (owner: 10TrainBranchBot) [23:52:06] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['apus-be2004'] [23:52:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['apus-be2004'] [23:54:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:56:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10831743 (10Jhancock.wm) [23:56:42] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-be2006'] [23:56:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['thanos-be2006'] [23:57:04] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2003'] [23:57:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest2003'] [23:57:25] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2003'] [23:57:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest2003'] [23:59:07] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10831744 (10Jhancock.wm) [23:59:24] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10831745 (10Jhancock.wm) [23:59:40] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10831746 (10Jhancock.wm)