[00:03:37] FIRING: SystemdUnitFailed: wmf_auto_restart_gnmic.service on netflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:40] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 633.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:19:45] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:23:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:28:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:33:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114146 [00:38:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114146 (owner: 10TrainBranchBot) [00:39:45] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:43:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:48:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:53:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:53:32] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114146 (owner: 10TrainBranchBot) [00:54:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:03:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:04:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114147 [01:08:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114147 (owner: 10TrainBranchBot) [01:09:45] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:13:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:24:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:27:14] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114147 (owner: 10TrainBranchBot) [01:28:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:33:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:38:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:40:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:03:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:04:45] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:14:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:18:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:18:40] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:23:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:28:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:33:30] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:34:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:45] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:48:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:49:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:37] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:18:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:23:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:38:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:39:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:46:36] (03PS1) 10Ottomata: beta EventStreamConfig - set eventgate hoist_fields_from_http_headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114149 (https://phabricator.wikimedia.org/T382173) [03:51:31] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [03:53:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:58:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:03:38] FIRING: SystemdUnitFailed: wmf_auto_restart_gnmic.service on netflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:08:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:24:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:29:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:03:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:04:45] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:40:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:48:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:53:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:56:57] 06SRE, 06DBA: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10495668 (10Marostegui) a:03Marostegui [06:08:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:09:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:13:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:22:46] (03PS1) 10Marostegui: db2182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114258 (https://phabricator.wikimedia.org/T384801) [06:23:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:23:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Index rebuild + upgrade [06:24:42] (03CR) 10Marostegui: [C:03+2] db2182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114258 (https://phabricator.wikimedia.org/T384801) (owner: 10Marostegui) [06:24:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1162.eqiad.wmnet with reason: Maintenance [06:28:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:33:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:43:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:48:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:49:45] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:54:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2204.codfw.wmnet with reason: Maintenance [07:08:37] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:31] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:19:46] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:20:50] (03PS1) 10Slyngshede: Failover IDP before reboot [dns] - 10https://gerrit.wikimedia.org/r/1114266 [07:23:08] (03CR) 10Slyngshede: [C:03+2] Failover IDP before reboot [dns] - 10https://gerrit.wikimedia.org/r/1114266 (owner: 10Slyngshede) [07:23:19] !log slyngshede@dns1004 START - running authdns-update [07:25:09] !log slyngshede@dns1004 END - running authdns-update [07:28:02] jouncebot: nowandnext [07:28:02] For the next 0 hour(s) and 31 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250126T0800) [07:28:02] In 0 hour(s) and 31 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T0800) [07:28:30] hmm [07:29:46] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:33:16] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:33:16] (03PS1) 10Samtar: IS: Enable wgUseCodexSpecialBlock on prod test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114324 (https://phabricator.wikimedia.org/T377121) [07:33:16] (03CR) 10Samtar: [C:04-2] "Do not merge: Blocked on T377121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114324 (https://phabricator.wikimedia.org/T377121) (owner: 10Samtar) [07:35:48] !log installing tomcat10 security updates [07:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1159.eqiad.wmnet with reason: Maintenance [07:40:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72439 and previous config saved to /var/cache/conftool/dbconfig/20250127-074030-marostegui.json [07:40:35] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [07:42:32] 06SRE, 06DBA, 13Patch-For-Review: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10495691 (10Marostegui) [07:42:53] 06SRE, 06DBA, 13Patch-For-Review: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10495693 (10Marostegui) p:05Triage→03Medium The host was upgraded and the tables are now being rebuilt. [07:43:16] RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:44:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2004.wikimedia.org [07:45:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 465MiB (3% inode=36%): /tmp 465MiB (3% inode=36%): /var/tmp 465MiB (3% inode=36%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [07:46:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2004.wikimedia.org [07:51:31] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [07:52:50] (03PS1) 10Muehlenhoff: Failover back to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1114325 [07:57:24] (03CR) 10Slyngshede: [C:03+1] "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/1114325 (owner: 10Muehlenhoff) [08:00:05] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72440 and previous config saved to /var/cache/conftool/dbconfig/20250127-080112-marostegui.json [08:01:18] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [08:02:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [08:02:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10495702 (10ops-monitoring-bot) Draining ganeti2020.codfw.wmnet of running VMs [08:03:37] FIRING: SystemdUnitFailed: wmf_auto_restart_gnmic.service on netflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [08:08:32] 06SRE, 10Deployments, 06Release-Engineering-Team: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804 (10hashar) 03NEW [08:12:33] !log installing rsync regression updates on bullseye [08:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P72441 and previous config saved to /var/cache/conftool/dbconfig/20250127-081619-marostegui.json [08:21:29] (03PS1) 10DCausse: airflow: enable show_trigger_form_if_no_params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) [08:23:27] (03PS1) 10Marostegui: mariadb: Remove es1023 [puppet] - 10https://gerrit.wikimedia.org/r/1114328 (https://phabricator.wikimedia.org/T384679) [08:24:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1023.eqiad.wmnet [08:26:18] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es1023 [puppet] - 10https://gerrit.wikimedia.org/r/1114328 (https://phabricator.wikimedia.org/T384679) (owner: 10Marostegui) [08:26:50] (03CR) 10Volans: k8s.pool-depool-node: Add support to downtime/remove downtime (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [08:30:03] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [08:30:30] (03CR) 10Gmodena: [C:03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse) [08:30:54] (03PS1) 10Fabfur: hiera: add haproxykafka to esams [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) [08:31:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P72442 and previous config saved to /var/cache/conftool/dbconfig/20250127-083126-marostegui.json [08:32:07] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:34:40] (03CR) 10DCausse: "I agree but seems like none of our dags are using those, I would suggest to add this config while we agree and migrate existing DAGs to th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse) [08:37:10] (03CR) 10Brouberol: "Nicely done! Do you want to test this on airflow-test-k8s before we merge, to make sure this does what we want?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse) [08:37:50] (03PS2) 10Fabfur: hiera: add haproxykafka to esams [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) [08:38:04] (03CR) 10DCausse: "sure! if possible please let me know how to do this, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse) [08:41:22] 06SRE, 06DBA, 13Patch-For-Review: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10495761 (10Marostegui) [08:41:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72443 and previous config saved to /var/cache/conftool/dbconfig/20250127-084145-root.json [08:42:01] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1023.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [08:42:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1023.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [08:42:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:42:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1023.eqiad.wmnet [08:42:34] !log installing gtk+3.0 security updates [08:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:53] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1023.eqiad.wmnet - https://phabricator.wikimedia.org/T384679#10495768 (10Marostegui) a:05Marostegui→03None [08:43:09] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1023.eqiad.wmnet - https://phabricator.wikimedia.org/T384679#10495773 (10Marostegui) This is ready for #dc-ops [08:43:46] (03PS1) 10Volans: netbox: use asctime in the logs [puppet] - 10https://gerrit.wikimedia.org/r/1114331 (https://phabricator.wikimedia.org/T379072) [08:44:05] (03PS2) 10Volans: netbox: use asctime in the logs [puppet] - 10https://gerrit.wikimedia.org/r/1114331 (https://phabricator.wikimedia.org/T379072) [08:44:43] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10495782 (10Volans) I've sent the above patch that I think should fix the issue. [08:46:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72444 and previous config saved to /var/cache/conftool/dbconfig/20250127-084633-marostegui.json [08:46:38] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [08:46:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1161.eqiad.wmnet with reason: Maintenance [08:47:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:47:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T384592)', diff saved to https://phabricator.wikimedia.org/P72445 and previous config saved to /var/cache/conftool/dbconfig/20250127-084713-marostegui.json [08:48:02] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:48:05] (03CR) 10Elukey: [C:03+1] "Really nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1114331 (https://phabricator.wikimedia.org/T379072) (owner: 10Volans) [08:48:15] (03PS1) 10Marostegui: rebuild_tables.sh: Add linter [software] - 10https://gerrit.wikimedia.org/r/1114332 (https://phabricator.wikimedia.org/T384799) [08:48:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1223', diff saved to https://phabricator.wikimedia.org/P72446 and previous config saved to /var/cache/conftool/dbconfig/20250127-084857-marostegui.json [08:49:12] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1223.eqiad.wmnet [08:49:32] !log Upgrade db1223 T384807 [08:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:36] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [08:50:41] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Index rebuild + upgrade [08:51:53] (03CR) 10Marostegui: "FYI guys!" [software] - 10https://gerrit.wikimedia.org/r/1114332 (https://phabricator.wikimedia.org/T384799) (owner: 10Marostegui) [08:51:55] (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh: Add linter [software] - 10https://gerrit.wikimedia.org/r/1114332 (https://phabricator.wikimedia.org/T384799) (owner: 10Marostegui) [08:52:20] (03Merged) 10jenkins-bot: rebuild_tables.sh: Add linter [software] - 10https://gerrit.wikimedia.org/r/1114332 (https://phabricator.wikimedia.org/T384799) (owner: 10Marostegui) [08:54:45] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1223.eqiad.wmnet [08:55:01] (03CR) 10Vgutierrez: [C:04-1] "hieradata/hosts/cp3066.yaml is no longer needed" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [08:55:52] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10495816 (10MoritzMuehlenhoff) [08:56:01] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Index rebuild [08:56:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72447 and previous config saved to /var/cache/conftool/dbconfig/20250127-085650-root.json [08:56:57] !log installing net-tools bugfix updates on bullseye [08:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:08] (03PS1) 10Marostegui: rebuild_tables.sh: Add downtime [software] - 10https://gerrit.wikimedia.org/r/1114333 (https://phabricator.wikimedia.org/T382842) [08:58:00] (03CR) 10Marostegui: "FYI" [software] - 10https://gerrit.wikimedia.org/r/1114333 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [08:58:45] (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh: Add downtime [software] - 10https://gerrit.wikimedia.org/r/1114333 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [08:59:54] (03Merged) 10jenkins-bot: rebuild_tables.sh: Add downtime [software] - 10https://gerrit.wikimedia.org/r/1114333 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [09:00:20] (03CR) 10Muehlenhoff: [C:03+2] Failover back to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1114325 (owner: 10Muehlenhoff) [09:00:25] !log jmm@dns1004 START - running authdns-update [09:01:20] (03PS3) 10Fabfur: hiera: add haproxykafka to esams [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) [09:02:15] !log jmm@dns1004 END - running authdns-update [09:07:49] RECOVERY - Host ms-fe1014 is UP: PING OK - Packet loss = 0%, RTA = 80.20 ms [09:08:04] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 2 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10495846 (10JMeybohm) [09:08:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T384592)', diff saved to https://phabricator.wikimedia.org/P72448 and previous config saved to /var/cache/conftool/dbconfig/20250127-090833-marostegui.json [09:08:38] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [09:11:52] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10495849 (10MoritzMuehlenhoff) [09:11:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72449 and previous config saved to /var/cache/conftool/dbconfig/20250127-091155-root.json [09:14:13] PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100% [09:14:47] 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Persistence-Automations, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#10495866 (10FCeratto-WMF) [09:16:18] (03CR) 10Muehlenhoff: [C:03+2] Switch an-test-presto1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1111180 (owner: 10Muehlenhoff) [09:23:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P72450 and previous config saved to /var/cache/conftool/dbconfig/20250127-092340-marostegui.json [09:25:16] (03PS1) 10Filippo Giunchedi: thanos: send sigkill as needed to stateless components [puppet] - 10https://gerrit.wikimedia.org/r/1114336 (https://phabricator.wikimedia.org/T383570) [09:25:39] (03PS3) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 [09:27:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72451 and previous config saved to /var/cache/conftool/dbconfig/20250127-092701-root.json [09:27:40] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10495902 (10jcrespo) a:03jcrespo [09:27:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet [09:29:22] (03CR) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [09:32:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet [09:32:06] (03CR) 10CI reject: [V:04-1] k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [09:32:14] (03PS1) 10Vgutierrez: hiera: Fix lvs::realserver::pools config for text and upload [puppet] - 10https://gerrit.wikimedia.org/r/1114337 [09:33:45] (03CR) 10Jelto: "to comments in-line" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 (owner: 10JMeybohm) [09:35:16] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114337 (owner: 10Vgutierrez) [09:38:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P72452 and previous config saved to /var/cache/conftool/dbconfig/20250127-093847-marostegui.json [09:40:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:40:54] (03CR) 10Filippo Giunchedi: [C:03+2] benthos: add nocookies and tls session metadata [puppet] - 10https://gerrit.wikimedia.org/r/1112248 (https://phabricator.wikimedia.org/T383900) (owner: 10Filippo Giunchedi) [09:41:16] (03PS4) 10JMeybohm: CI: Fix helm errors hiding behind YAML parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 [09:41:30] (03CR) 10JMeybohm: CI: Fix helm errors hiding behind YAML parser (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 (owner: 10JMeybohm) [09:42:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72453 and previous config saved to /var/cache/conftool/dbconfig/20250127-094206-root.json [09:47:32] !log reimaging rpki1001 to bookworm [09:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host rpki1001.eqiad.wmnet with OS bookworm [09:49:21] (03CR) 10Jcrespo: [C:03+1] No longer import prometheus-mysqld-exporter from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1111269 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [09:49:30] (03CR) 10Jelto: [C:03+1] "lgtm as far as I can tell" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 (owner: 10JMeybohm) [09:50:16] (03CR) 10Fabfur: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [09:50:27] 06SRE, 13Patch-For-Review: Add x-analytics nocookie=1 and x-tls-sess to webrequest-sampled-live stream - https://phabricator.wikimedia.org/T383900#10496038 (10fgiunchedi) `tls_sess` and `nocookies` fields are now part of `webrequest_sampled` topic! [09:53:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T384592)', diff saved to https://phabricator.wikimedia.org/P72454 and previous config saved to /var/cache/conftool/dbconfig/20250127-095354-marostegui.json [09:53:59] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [09:54:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1185.eqiad.wmnet with reason: Maintenance [09:54:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T384592)', diff saved to https://phabricator.wikimedia.org/P72455 and previous config saved to /var/cache/conftool/dbconfig/20250127-095416-marostegui.json [09:55:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:04:02] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cr[1-2]-magru,cr[1-2]-magru IPv6 with reason: upgrading JunOS on magru core routers [10:04:26] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:05:12] (03CR) 10Jelto: [C:03+1] "forgot to actually hit +1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113979 (owner: 10JMeybohm) [10:07:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [10:09:08] (03PS1) 10Jcrespo: admin: Add neslihanturan to the list of privileged LDAP-only users [puppet] - 10https://gerrit.wikimedia.org/r/1114341 (https://phabricator.wikimedia.org/T384017) [10:09:27] 06SRE, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10496117 (10Volans) > I believe SRE are instead using their own private channel. It's `#wikimedia-sre` and it's a public channel (as mention... [10:09:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2005.wikimedia.org [10:11:54] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1114341 (https://phabricator.wikimedia.org/T384017) (owner: 10Jcrespo) [10:11:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore[1007,1009].eqiad.wmnet with reason: Index rebuild + upgrade [10:13:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2005.wikimedia.org [10:14:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T384592)', diff saved to https://phabricator.wikimedia.org/P72456 and previous config saved to /var/cache/conftool/dbconfig/20250127-101401-marostegui.json [10:14:06] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [10:16:46] (03CR) 10Jcrespo: [C:03+2] admin: Add neslihanturan to the list of privileged LDAP-only users [puppet] - 10https://gerrit.wikimedia.org/r/1114341 (https://phabricator.wikimedia.org/T384017) (owner: 10Jcrespo) [10:16:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore[1007,1009].eqiad.wmnet with reason: Index rebuild + upgrade [10:18:59] 06SRE, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10496179 (10hashar) [10:20:06] !log installing updated JunOS image on cr2-magru T384774 [10:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:10] T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774 [10:20:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2004.wikimedia.org [10:23:15] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10496200 (10jcrespo) 05Open→03Resolved Your account, @Neslihan_Turan_WMDE, already appears as a member of the NDA and WMDE groups: https://ldap.toolforge.... [10:24:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2004.wikimedia.org [10:25:42] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:26:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1024 T384820', diff saved to https://phabricator.wikimedia.org/P72457 and previous config saved to /var/cache/conftool/dbconfig/20250127-102657-marostegui.json [10:27:02] T384820: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820 [10:27:31] PROBLEM - ganeti-wconfd running on ganeti2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:27:43] (03PS1) 10Marostegui: es1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114342 (https://phabricator.wikimedia.org/T384820) [10:28:03] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [10:28:29] (03CR) 10Marostegui: [C:03+2] es1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114342 (https://phabricator.wikimedia.org/T384820) (owner: 10Marostegui) [10:29:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P72458 and previous config saved to /var/cache/conftool/dbconfig/20250127-102908-marostegui.json [10:29:59] (03CR) 10Vgutierrez: [C:03+1] hiera: add haproxykafka to esams [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [10:31:45] (03CR) 10Fabfur: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [10:31:47] (03CR) 10Fabfur: [C:03+2] hiera: add haproxykafka to esams [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [10:33:44] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host rpki1001.eqiad.wmnet with OS bookworm [10:34:05] !log installing haproxykafka on esams (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114329) (T378578) [10:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:09] T378578: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578 [10:34:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host rpki1001.eqiad.wmnet with OS bookworm [10:35:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [10:35:47] (03CR) 10Klausman: "Thanks a ton for your work on this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114012 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [10:36:28] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host rpki1001.eqiad.wmnet with OS bookworm [10:36:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host rpki1001.eqiad.wmnet with OS bookworm [10:37:46] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1171.eqiad.wmnet with reason: reimage [10:38:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1004.wikimedia.org [10:40:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [10:40:46] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:40:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:42:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1004.wikimedia.org [10:42:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [10:42:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496316 (10ops-monitoring-bot) Draining ganeti2025.codfw.wmnet of running VMs [10:43:18] (03PS1) 10Muehlenhoff: Switch ganeti2025 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114346 [10:43:43] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1171.eqiad.wmnet with OS bookworm [10:43:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [10:43:58] !log testing pybal 1.15.15 in lvs4010 [10:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P72459 and previous config saved to /var/cache/conftool/dbconfig/20250127-104415-marostegui.json [10:47:38] !log rebooting cr2-magru to complete upgrade T384774 [10:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:42] T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774 [10:50:22] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:50:28] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:50:40] (03CR) 10Fabfur: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1113478 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [10:50:48] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:50:54] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:54:48] ^^ this is due to cr2-magru rebooting all ok [10:54:48] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:54:48] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:56:32] 06SRE, 06DBA, 13Patch-For-Review: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10496376 (10Marostegui) Tables rebuilt, host catching up. [10:56:43] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on asw1-b[3-4]-magru.mgmt with reason: upgrading JunOS on magru core routers [10:58:18] FTR, I probably won’t be able to do the UTC afternoon backport window today [10:58:23] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:58:30] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:58:50] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:58:54] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:59:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T384592)', diff saved to https://phabricator.wikimedia.org/P72460 and previous config saved to /var/cache/conftool/dbconfig/20250127-105922-marostegui.json [10:59:27] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [10:59:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:59:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T384592)', diff saved to https://phabricator.wikimedia.org/P72461 and previous config saved to /var/cache/conftool/dbconfig/20250127-105944-marostegui.json [10:59:48] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2185.codfw.wmnet [10:59:59] (03PS1) 10Muehlenhoff: Fix preseed pattern for cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1114349 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1100) [11:00:09] (03PS2) 10Muehlenhoff: Fix preseed pattern for cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1114349 [11:04:13] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host rpki1001.eqiad.wmnet with OS bookworm [11:04:26] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2185.codfw.wmnet [11:08:38] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:07] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1114349 (owner: 10Muehlenhoff) [11:11:19] (03CR) 10Muehlenhoff: [C:03+2] Fix preseed pattern for cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1114349 (owner: 10Muehlenhoff) [11:14:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [11:14:57] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2025.codfw.wmnet [11:17:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host rpki1001.eqiad.wmnet with OS bookworm [11:19:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T384592)', diff saved to https://phabricator.wikimedia.org/P72462 and previous config saved to /var/cache/conftool/dbconfig/20250127-111924-marostegui.json [11:19:29] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [11:19:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to drbd [11:19:57] (03CR) 10JMeybohm: [C:03+1] drivers.py: add container_limits to the Docker driver (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey) [11:20:15] !log installing updated JunOS image on cr1-magru T384774 [11:20:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496490 (10ops-monitoring-bot) VM kubestagemaster2003.codfw.wmnet switching disk type to drbd [11:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:19] T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774 [11:23:54] (03PS1) 10Gergő Tisza: Add machine-readable markings for SUL3 extension denylist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114351 [11:24:06] (03PS1) 10Vgutierrez: wmflib,pybal: Add scheduler_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027) [11:24:47] (03CR) 10TChin: [C:03+2] mw-content-history-reconcile-enrich: Add HA storageDir and Ceph egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109448 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [11:25:11] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [11:26:10] (03Merged) 10jenkins-bot: mw-content-history-reconcile-enrich: Add HA storageDir and Ceph egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109448 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [11:26:28] jouncebot: now [11:26:28] For the next 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1100) [11:27:41] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): scale next to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114005 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [11:27:53] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on rpki1001.eqiad.wmnet with reason: host reimage [11:29:38] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10496529 (10cmooney) [11:29:38] (03CR) 10Effie Mouzeli: [C:03+2] mw-(api-ext|web): scale next to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114005 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [11:30:25] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10496532 (10cmooney) [11:30:45] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-(api-ext|web): scale next to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114005 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [11:31:18] (03Merged) 10jenkins-bot: mw-(api-ext|web): scale next to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114005 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [11:32:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rpki1001.eqiad.wmnet with reason: host reimage [11:33:03] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [11:33:09] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [11:34:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P72463 and previous config saved to /var/cache/conftool/dbconfig/20250127-113431-marostegui.json [11:34:46] !log rebooting cr1-magru to complete upgrade T384774 [11:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:50] T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774 [11:35:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to drbd [11:35:41] PROBLEM - Host kubestagemaster2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [11:36:18] ^ expected, temporarily changing disk image to reimage a ganeti node [11:36:25] RECOVERY - Host kubestagemaster2003 is UP: PING OK - Packet loss = 0%, RTA = 30.79 ms [11:36:26] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496565 (10ops-monitoring-bot) Draining ganeti2025.codfw.wmnet of running VMs [11:36:28] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:36:57] FIRING: KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:37:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [11:37:28] (03PS2) 10Vgutierrez: wmflib,pybal: Add scheduler_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027) [11:37:45] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:37:45] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:37:47] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10496573 (10MoritzMuehlenhoff) [11:38:31] ^^ these are due to cr1-magru reboot [11:38:37] RESOLVED: SystemdUnitFailed: wmf_auto_restart_gnmic.service on netflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to plain [11:39:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496584 (10ops-monitoring-bot) VM kubestagemaster2003.codfw.wmnet switching disk type to plain [11:39:18] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:39:19] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1215.eqiad.wmnet [11:39:31] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:39:42] !log Upgrade and reboot zarcillo/orchestrator database db1215 [11:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to plain [11:40:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [11:41:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496608 (10ops-monitoring-bot) Draining ganeti2025.codfw.wmnet of running VMs [11:41:43] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:41:57] RESOLVED: KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:41:59] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:44:35] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez) [11:44:49] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:44:53] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:45:06] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1215.eqiad.wmnet [11:45:49] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:45:58] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:46:10] (03PS1) 10Kamila Součková: wikikube: rename parse10[18-24] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1114354 (https://phabricator.wikimedia.org/T365571) [11:47:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rpki1001.eqiad.wmnet with OS bookworm [11:49:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P72465 and previous config saved to /var/cache/conftool/dbconfig/20250127-114938-marostegui.json [11:50:15] (03CR) 10Effie Mouzeli: [C:03+2] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [11:50:15] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:51:31] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [11:52:14] (03Merged) 10jenkins-bot: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [11:52:22] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10496659 (10Papaul) [11:52:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 10%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72466 and previous config saved to /var/cache/conftool/dbconfig/20250127-115239-root.json [11:52:44] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [11:53:22] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10496664 (10Papaul) @Jhancock.wm you can move ganeti2020 anytime today. Once done just ping @MoritzMuehlenhoff .... [11:53:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:55:35] (03PS1) 10Vgutierrez: service: Add scheduler_flag field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/1114356 (https://phabricator.wikimedia.org/T373027) [11:56:30] !log root@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1171.eqiad.wmnet with OS bookworm [11:58:00] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1171.eqiad.wmnet with OS bookworm [12:02:47] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2020 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113963 (owner: 10Muehlenhoff) [12:04:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T384592)', diff saved to https://phabricator.wikimedia.org/P72467 and previous config saved to /var/cache/conftool/dbconfig/20250127-120445-marostegui.json [12:04:50] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [12:05:01] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1210.eqiad.wmnet with reason: Maintenance [12:05:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T384592)', diff saved to https://phabricator.wikimedia.org/P72468 and previous config saved to /var/cache/conftool/dbconfig/20250127-120507-marostegui.json [12:06:39] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [12:07:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 25%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72469 and previous config saved to /var/cache/conftool/dbconfig/20250127-120744-root.json [12:07:49] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [12:08:42] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:08:58] !log installing git-lfs security updates [12:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:39] (03CR) 10Vgutierrez: "`swift` and `swift-https` services are the only services defined on `hieradata/common/service.yaml` from the LVS PoV. `swift-https` LVS se" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [12:12:41] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1171.eqiad.wmnet with reason: host reimage [12:15:27] !jouncebot now [12:15:27] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [12:15:32] !jouncebot next [12:15:32] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [12:16:39] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: host reimage [12:17:22] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114365 [12:18:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2227 T384807', diff saved to https://phabricator.wikimedia.org/P72470 and previous config saved to /var/cache/conftool/dbconfig/20250127-121843-marostegui.json [12:18:48] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [12:19:01] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2227.codfw.wmnet [12:21:45] (03CR) 10Volans: [C:03+1] "LGTM, this can be merged anytime as the new property has a default value" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1114356 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez) [12:22:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 50%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72471 and previous config saved to /var/cache/conftool/dbconfig/20250127-122249-root.json [12:23:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T384592)', diff saved to https://phabricator.wikimedia.org/P72472 and previous config saved to /var/cache/conftool/dbconfig/20250127-122320-marostegui.json [12:23:25] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [12:25:02] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2227.codfw.wmnet [12:27:04] (03CR) 10Muehlenhoff: [C:03+2] No longer import prometheus-mysqld-exporter from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1111269 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [12:29:12] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2227.codfw.wmnet with reason: Index rebuild [12:31:21] 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10496715 (10RobH) [12:37:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 75%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72473 and previous config saved to /var/cache/conftool/dbconfig/20250127-123754-root.json [12:38:00] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [12:38:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P72474 and previous config saved to /var/cache/conftool/dbconfig/20250127-123827-marostegui.json [12:39:15] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1171.eqiad.wmnet with OS bookworm [12:50:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:53:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 100%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72476 and previous config saved to /var/cache/conftool/dbconfig/20250127-125301-root.json [12:53:05] (03PS1) 10Marostegui: Revert "db2182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114375 [12:53:06] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [12:53:10] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1114375 (owner: 10Marostegui) [12:53:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P72477 and previous config saved to /var/cache/conftool/dbconfig/20250127-125334-marostegui.json [12:54:05] (03PS1) 10Btullis: Add the service_proxy to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) [12:54:26] (03CR) 10CI reject: [V:04-1] Add the service_proxy to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [12:55:27] (03PS2) 10Btullis: Add the service_proxy to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) [12:56:55] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4862/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [12:57:31] (03PS2) 10Anzx: srwiki: add incubator as importsource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114378 (https://phabricator.wikimedia.org/T384069) [12:57:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114378 (https://phabricator.wikimedia.org/T384069) (owner: 10Anzx) [12:58:12] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 10observability, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856#10496789 (10fgiunchedi) Something that occurred to me while looking at {T366710}: with mw-to-k8s we ar... [13:00:30] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 1200MiB (0% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [13:01:33] (03PS3) 10Anzx: enwiki: temporary lift of IP cap for 31 January and 1 February 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114372 (https://phabricator.wikimedia.org/T384680) [13:01:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114372 (https://phabricator.wikimedia.org/T384680) (owner: 10Anzx) [13:07:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [13:08:24] (03CR) 10Elukey: drivers.py: add container_limits to the Docker driver (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey) [13:08:32] (03CR) 10Elukey: [C:03+2] drivers.py: add container_limits to the Docker driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey) [13:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T384592)', diff saved to https://phabricator.wikimedia.org/P72478 and previous config saved to /var/cache/conftool/dbconfig/20250127-130841-marostegui.json [13:08:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1216.eqiad.wmnet with reason: Maintenance [13:08:46] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:10:36] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1013.eqiad.wmnet with OS bullseye [13:10:57] (03CR) 10Effie Mouzeli: [C:03+1] dsh: empty scap proxy list [puppet] - 10https://gerrit.wikimedia.org/r/1112714 (https://phabricator.wikimedia.org/T384196) (owner: 10Hnowlan) [13:11:00] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye [13:12:26] (03Merged) 10jenkins-bot: drivers.py: add container_limits to the Docker driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey) [13:13:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10496815 (10elukey) [13:13:30] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2025.codfw.wmnet with reason: remove from cluster for reimage [13:13:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496817 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4302b551-98b7-475e-9fb4-959f5c56a6cc) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [13:14:35] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2025 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114346 (owner: 10Muehlenhoff) [13:15:08] (03CR) 10Marostegui: Revert "db2182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114375 (owner: 10Marostegui) [13:15:09] (03CR) 10Marostegui: [C:03+2] Revert "db2182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114375 (owner: 10Marostegui) [13:15:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 10%: Repooling T384801', diff saved to https://phabricator.wikimedia.org/P72479 and previous config saved to /var/cache/conftool/dbconfig/20250127-131554-root.json [13:15:59] T384801: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801 [13:16:11] 06SRE, 06DBA, 13Patch-For-Review: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10496825 (10Marostegui) 05Open→03Resolved Host being repooled automatically. [13:18:50] !log fceratto@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2140.codfw.wmnet [13:23:44] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [13:23:54] (03CR) 10JMeybohm: [C:03+1] "🥳" [puppet] - 10https://gerrit.wikimedia.org/r/1114354 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [13:25:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2025.codfw.wmnet with OS bookworm [13:26:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496842 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bookworm [13:26:44] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1013.eqiad.wmnet with reason: host reimage [13:27:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [13:27:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1157 T384807', diff saved to https://phabricator.wikimedia.org/P72480 and previous config saved to /var/cache/conftool/dbconfig/20250127-132710-marostegui.json [13:27:15] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [13:27:32] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1157.eqiad.wmnet [13:27:33] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2140.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002" [13:28:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1230.eqiad.wmnet with reason: Maintenance [13:28:00] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2140.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002" [13:28:01] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:28:01] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2140.codfw.wmnet [13:28:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T384592)', diff saved to https://phabricator.wikimedia.org/P72481 and previous config saved to /var/cache/conftool/dbconfig/20250127-132806-marostegui.json [13:28:14] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:28:41] (03CR) 10Federico Ceratto: [C:03+1] site.pp, db2140.yaml: remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113967 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto) [13:28:58] (03PS2) 10Federico Ceratto: site.pp, db2140.yaml: remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113967 (https://phabricator.wikimedia.org/T384480) [13:31:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 25%: Repooling T384801', diff saved to https://phabricator.wikimedia.org/P72482 and previous config saved to /var/cache/conftool/dbconfig/20250127-133059-root.json [13:31:06] T384801: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801 [13:31:18] (03CR) 10Federico Ceratto: [C:03+2] site.pp, db2140.yaml: remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113967 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto) [13:32:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [13:32:25] !log installing runc security updates on bullseye [13:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:31] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1013.eqiad.wmnet with reason: host reimage [13:34:04] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1157.eqiad.wmnet [13:34:53] (03PS1) 10Urbanecm: [Growth] enwiki: Release Add Link to 10% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114379 (https://phabricator.wikimedia.org/T384551) [13:34:54] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Index rebuild [13:35:46] !log Removing db2140 from zarcillo T384480 [13:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:50] T384480: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480 [13:36:32] (03PS1) 10TChin: mw-content-history-reconcile-enrich: Use full kafka test fqdn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114380 (https://phabricator.wikimedia.org/T375176) [13:38:24] 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10496896 (10RobH) Summary of case updates since 22nd: * Dell opened the case and requested the TSR which I couldn't attach due to it being 22MB, so the... [13:39:03] (03CR) 10JMeybohm: [C:03+1] kserve: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114012 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [13:39:38] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480#10496900 (10FCeratto-WMF) [13:40:47] (03CR) 10Btullis: "If it's a temporary workaround, could we not add it to the search instance alone?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse) [13:41:19] (03CR) 10Brouberol: [C:03+1] mw-content-history-reconcile-enrich: Use full kafka test fqdn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114380 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [13:41:38] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480#10496903 (10FCeratto-WMF) The host is ready for the DC-Ops team to decommission. [13:43:21] (03CR) 10TChin: [C:03+2] mw-content-history-reconcile-enrich: Use full kafka test fqdn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114380 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [13:43:42] (03CR) 10JMeybohm: [C:03+1] "looks reasonable. The audit log should tell if you missed anything." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114016 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [13:44:00] (03PS1) 10Filippo Giunchedi: query_service: clean up icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114381 (https://phabricator.wikimedia.org/T358029) [13:44:36] (03Merged) 10jenkins-bot: mw-content-history-reconcile-enrich: Use full kafka test fqdn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114380 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [13:46:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 50%: Repooling T384801', diff saved to https://phabricator.wikimedia.org/P72483 and previous config saved to /var/cache/conftool/dbconfig/20250127-134605-root.json [13:46:12] T384801: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801 [13:46:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T384592)', diff saved to https://phabricator.wikimedia.org/P72484 and previous config saved to /var/cache/conftool/dbconfig/20250127-134650-marostegui.json [13:46:55] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:46:58] (03PS1) 10Dreamrimmer: Changed default license for Wikinews to CC-BY-4.0 and for fawikinews and svwikinews to CC-BY-SA-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114382 (https://phabricator.wikimedia.org/T384614) [13:47:26] (03CR) 10DCausse: "I think the problem is not only affecting search, but all airflow instances on my side I'll never hit the 'Trigger DAG' again and rely on " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse) [13:48:12] (03PS1) 10Brouberol: envoy: define an mw-misc service mesh entry [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) [13:49:48] (03PS2) 10Brouberol: envoy: define an mw-misc service mesh entry [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) [13:50:46] (03PS1) 10Andrew Bogott: Updates for cloudcephosd1013: puppet 7 + Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1114384 [13:51:18] (03CR) 10Andrew Bogott: [C:03+2] Updates for cloudcephosd1013: puppet 7 + Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1114384 (owner: 10Andrew Bogott) [13:53:03] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1013.eqiad.wmnet with OS bullseye [13:53:17] (03CR) 10JMeybohm: envoy: define an mw-misc service mesh entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [13:53:23] !log gmodena@deploy2002 Started deploy [airflow-dags/search@3c004c1]: syncing artifacts [13:53:25] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [13:53:31] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [13:53:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114382 (https://phabricator.wikimedia.org/T384614) (owner: 10Dreamrimmer) [13:53:45] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye [13:53:52] !log gmodena@deploy2002 Finished deploy [airflow-dags/search@3c004c1]: syncing artifacts (duration: 01m 04s) [13:56:17] (03CR) 10Ssingh: [C:03+1] "Yep, makes sense, nice catch and thanks for fixing it." [puppet] - 10https://gerrit.wikimedia.org/r/1114337 (owner: 10Vgutierrez) [13:58:44] (03CR) 10Ssingh: "Thanks for the patch! I propose that we either do this for all single-backend sites (profile::cache::varnish::frontend::single_backend: tr" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [13:59:47] (03PS3) 10Brouberol: envoy: define an mw-misc service mesh entry [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) [13:59:55] (03CR) 10Brouberol: envoy: define an mw-misc service mesh entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:00:13] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1400). [14:00:13] toni_, anzx, and DreamRimmer: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:23] (03CR) 10Michael Große: [C:03+1] [Growth] enwiki: Release Add Link to 10% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114379 (https://phabricator.wikimedia.org/T384551) (owner: 10Urbanecm) [14:00:23] o/ [14:00:33] here [14:01:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 75%: Repooling T384801', diff saved to https://phabricator.wikimedia.org/P72485 and previous config saved to /var/cache/conftool/dbconfig/20250127-140111-root.json [14:01:16] T384801: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801 [14:01:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P72486 and previous config saved to /var/cache/conftool/dbconfig/20250127-140157-marostegui.json [14:02:56] I can’t deploy today, sorry [14:04:52] (03PS1) 10Slyngshede: Upgrade to CAS 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1114388 [14:06:11] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2025.codfw.wmnet with reason: host reimage [14:07:23] (03CR) 10JMeybohm: [C:03+1] envoy: define an mw-misc service mesh entry [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:07:26] I can [14:07:40] (03CR) 10Zabe: [C:03+2] Add ios.article_link_interaction stream to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113996 (https://phabricator.wikimedia.org/T382031) (owner: 10Tsevener) [14:07:56] (03CR) 10Zabe: [C:03+2] srwiki: add incubator as importsource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114378 (https://phabricator.wikimedia.org/T384069) (owner: 10Anzx) [14:08:30] (03CR) 10Zabe: [C:03+2] enwiki: temporary lift of IP cap for 31 January and 1 February 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114372 (https://phabricator.wikimedia.org/T384680) (owner: 10Anzx) [14:09:11] (03Merged) 10jenkins-bot: Add ios.article_link_interaction stream to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113996 (https://phabricator.wikimedia.org/T382031) (owner: 10Tsevener) [14:09:13] (03Merged) 10jenkins-bot: srwiki: add incubator as importsource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114378 (https://phabricator.wikimedia.org/T384069) (owner: 10Anzx) [14:09:15] (03Merged) 10jenkins-bot: enwiki: temporary lift of IP cap for 31 January and 1 February 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114372 (https://phabricator.wikimedia.org/T384680) (owner: 10Anzx) [14:09:52] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1113996|Add ios.article_link_interaction stream to config (T382031)]], [[gerrit:1114378|srwiki: add incubator as importsource (T384069)]], [[gerrit:1114372|enwiki: temporary lift of IP cap for 31 January and 1 February 2025 (T384680)]] [14:09:59] T382031: Track impressions for article views - https://phabricator.wikimedia.org/T382031 [14:10:00] T384069: Add an import source for "Special:Import" on sr.wiki - https://phabricator.wikimedia.org/T384069 [14:10:00] T384680: Requesting temporary lift of IP cap for 31 January and 1 February 2025 - https://phabricator.wikimedia.org/T384680 [14:10:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2025.codfw.wmnet with reason: host reimage [14:10:38] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1013.eqiad.wmnet with reason: host reimage [14:11:03] (03CR) 10Brouberol: [C:03+2] envoy: define an mw-misc service mesh entry [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [14:12:25] (03CR) 10Bartosz Dziewoński: [C:03+1] "Seems harmless if it helps you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114351 (owner: 10Gergő Tisza) [14:13:49] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1013.eqiad.wmnet with reason: host reimage [14:14:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1114388 (owner: 10Slyngshede) [14:16:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 100%: Repooling T384801', diff saved to https://phabricator.wikimedia.org/P72487 and previous config saved to /var/cache/conftool/dbconfig/20250127-141616-root.json [14:16:20] (03CR) 10Ottomata: [C:03+1] "This is really cool. We should add this to all analytics clients, including stat boxes! That way airflow dev envs can use the same URLs " [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [14:16:21] T384801: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801 [14:17:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P72488 and previous config saved to /var/cache/conftool/dbconfig/20250127-141704-marostegui.json [14:17:49] (03CR) 10Fabfur: liberica: Add katran config settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [14:18:03] it is slow [14:18:16] np [14:21:39] !log zabe@deploy2002 tsev, zabe, anzx: Backport for [[gerrit:1113996|Add ios.article_link_interaction stream to config (T382031)]], [[gerrit:1114378|srwiki: add incubator as importsource (T384069)]], [[gerrit:1114372|enwiki: temporary lift of IP cap for 31 January and 1 February 2025 (T384680)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:21:42] toni_: anzx: can you test your patches? [14:21:45] T382031: Track impressions for article views - https://phabricator.wikimedia.org/T382031 [14:21:45] T384069: Add an import source for "Special:Import" on sr.wiki - https://phabricator.wikimedia.org/T384069 [14:21:46] T384680: Requesting temporary lift of IP cap for 31 January and 1 February 2025 - https://phabricator.wikimedia.org/T384680 [14:21:52] zabe: import source looks ok, nothing to test on throttle [14:22:36] looks good to me [14:22:39] alright [14:22:43] !log zabe@deploy2002 tsev, zabe, anzx: Continuing with sync [14:24:49] DreamRimmer: around? [14:24:58] yes [14:25:34] (03PS25) 10Bking: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:27:12] DreamRimmer: the rfc etc states pretty specific dates for the switch (1st of Feb / 30th of Jan). would you say it is okay to already do it today? [14:28:09] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse[1018-1024].eqiad.wmnet [14:28:18] (03CR) 10Kamila Součková: [C:03+2] wikikube: rename parse10[18-24] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1114354 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [14:28:37] I don't see any issue [14:29:22] (03CR) 10Vgutierrez: [C:03+2] hiera: Fix lvs::realserver::pools config for text and upload [puppet] - 10https://gerrit.wikimedia.org/r/1114337 (owner: 10Vgutierrez) [14:31:00] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1013.eqiad.wmnet with OS bullseye [14:31:54] (03PS1) 10Filippo Giunchedi: base: absent check_microcode [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694) [14:31:55] (03CR) 10Zabe: [C:03+2] Changed default license for Wikinews to CC-BY-4.0 and for fawikinews and svwikinews to CC-BY-SA-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114382 (https://phabricator.wikimedia.org/T384614) (owner: 10Dreamrimmer) [14:32:03] (03CR) 10Zabe: [C:03+2] Increase revision-slots cache expiry back to default for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114060 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [14:32:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T384592)', diff saved to https://phabricator.wikimedia.org/P72489 and previous config saved to /var/cache/conftool/dbconfig/20250127-143211-marostegui.json [14:32:16] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:32:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse[1018-1024].eqiad.wmnet [14:32:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1245.eqiad.wmnet with reason: Maintenance [14:32:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2025.codfw.wmnet with OS bookworm [14:32:52] (03Merged) 10jenkins-bot: Changed default license for Wikinews to CC-BY-4.0 and for fawikinews and svwikinews to CC-BY-SA-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114382 (https://phabricator.wikimedia.org/T384614) (owner: 10Dreamrimmer) [14:32:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10497129 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bookworm completed: - ganeti202... [14:33:07] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1018 to wikikube-worker1159 [14:33:27] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:33:40] (03Merged) 10jenkins-bot: Increase revision-slots cache expiry back to default for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114060 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [14:34:02] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480#10497132 (10Marostegui) a:05FCeratto-WMF→03None [14:34:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv [14:34:27] e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:27] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv [14:34:27] e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:35:30] (03PS1) 10Fabfur: hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) [14:35:50] (03CR) 10CI reject: [V:04-1] hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [14:35:53] (03PS2) 10Scott French: mw-on-k8s: aggregate remaining alerts by release name [alerts] - 10https://gerrit.wikimedia.org/r/1114018 (https://phabricator.wikimedia.org/T384532) [14:35:57] (03CR) 10Fabfur: [C:04-1] "Do not merge until 28/01/2025" [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [14:36:36] (03PS2) 10Fabfur: hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [14:36:58] (03CR) 10CI reject: [V:04-1] hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [14:37:27] !log jelto@cumin1002 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [14:37:29] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1018 to wikikube-worker1159 - kamila@cumin1002" [14:37:38] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1019 to wikikube-worker1160 [14:37:44] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1018 to wikikube-worker1159 - kamila@cumin1002" [14:37:44] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:44] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1159 [14:37:45] (03PS3) 10Fabfur: hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) [14:37:58] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:38:01] for some reason the number of left k8s nodes is increasing [14:38:04] curious [14:38:34] (03CR) 10Fabfur: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [14:38:46] (03CR) 10Klausman: [C:03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114016 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [14:38:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1159 [14:39:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1018 to wikikube-worker1159 [14:39:43] ok, aborting [14:40:19] !log zabe@deploy2002 Started scap sync-world: T384614 T183490 [14:40:25] T384614: Change of default license for Wikinews to CC-BY-4.0 and for fawikinews and svwikinews to CC-BY-SA-4.0 on January 30, 2025 - https://phabricator.wikimedia.org/T384614 [14:40:25] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [14:41:44] (03CR) 10Xcollazo: "Excuse my ignorance, but will this also allow us to hit endpoints like "https://noc.wikimedia.org/conf/dblists/open.dblist" ?" [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [14:43:09] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1019 to wikikube-worker1160 - kamila@cumin1002" [14:43:25] DreamRimmer: can you test your change? [14:43:30] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1020 to wikikube-worker1161 [14:43:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1019 to wikikube-worker1160 - kamila@cumin1002" [14:43:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:43:36] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1160 [14:43:41] checking [14:43:50] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:44:02] !log zabe@deploy2002 zabe: T384614 T183490 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:44:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on parse1021:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:45:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1160 [14:45:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [14:45:33] look good to me [14:45:37] alright [14:45:38] !log zabe@deploy2002 zabe: Continuing with sync [14:45:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1019 to wikikube-worker1160 [14:47:21] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1020 to wikikube-worker1161 - kamila@cumin1002" [14:47:33] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1021 to wikikube-worker1162 [14:47:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1020 to wikikube-worker1161 - kamila@cumin1002" [14:47:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:47:37] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1161 [14:47:53] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:48:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1161 [14:49:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1020 to wikikube-worker1161 [14:49:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on parse1022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:50:05] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [14:50:11] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [14:50:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:50:20] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[345] - https://phabricator.wikimedia.org/T384838 (10RobH) 03NEW [14:50:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [14:51:26] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1021 to wikikube-worker1162 - kamila@cumin1002" [14:51:35] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [14:51:38] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1022 to wikikube-worker1163 [14:51:38] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [14:51:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1021 to wikikube-worker1162 - kamila@cumin1002" [14:51:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:51:43] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1162 [14:51:46] sorry, connection dropped. Looks good in prod, thanks for deploying zabe [14:51:58] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:52:09] yw [14:52:12] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[345] - https://phabricator.wikimedia.org/T384838#10497221 (10RobH) a:03MoritzMuehlenhoff @MoritzMuehlenhoff, I didn't want to hold up the ordering of parent task T382898 so I've escalated that (with Joanna's approval) and... [14:52:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1162 [14:52:53] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [14:52:58] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [14:53:01] (03PS1) 10Giuseppe Lavagetto: Revert "webperf: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1114393 [14:53:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1021 to wikikube-worker1162 [14:53:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10497248 (10Papaul) @JMeybohm can we do this today? if not please let me know when will be a good d... [14:54:01] (03PS26) 10Bking: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:54:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:54:39] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4863/console" [puppet] - 10https://gerrit.wikimedia.org/r/1103318 (owner: 10Muehlenhoff) [14:55:24] (03PS27) 10Bking: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:55:41] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1022 to wikikube-worker1163 - kamila@cumin1002" [14:55:44] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1023 to wikikube-worker1164 [14:55:54] (03PS1) 10Phuedx: testwiki: Enable MetricsPlatform experiment enrollment overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) [14:55:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1022 to wikikube-worker1163 - kamila@cumin1002" [14:55:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:55:57] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1163 [14:56:03] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:56:11] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4864/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114393 (owner: 10Giuseppe Lavagetto) [14:56:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10497264 (10kamila) a:03VRiley-WMF [14:56:57] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[3-8] - https://phabricator.wikimedia.org/T384838#10497267 (10RobH) [14:57:26] !log zabe@deploy2002 sync-world aborted: T384614 T183490 (duration: 17m 07s) [14:57:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1163 [14:57:32] T384614: Change of default license for Wikinews to CC-BY-4.0 and for fawikinews and svwikinews to CC-BY-SA-4.0 on January 30, 2025 - https://phabricator.wikimedia.org/T384614 [14:57:32] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [14:57:52] (03CR) 10Máté Szabó: [C:03+1] Revert "webperf: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1114393 (owner: 10Giuseppe Lavagetto) [14:58:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1022 to wikikube-worker1163 [14:58:29] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [14:58:56] (03PS2) 10Giuseppe Lavagetto: Revert "webperf: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1114393 (https://phabricator.wikimedia.org/T384836) [14:59:16] ok, it formally aborted, but it reached all k8s nodes [14:59:35] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1023 to wikikube-worker1164 - kamila@cumin1002" [14:59:40] (03CR) 10Giuseppe Lavagetto: [C:03+2] Revert "webperf: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1114393 (https://phabricator.wikimedia.org/T384836) (owner: 10Giuseppe Lavagetto) [14:59:43] zabe: did the k8s production update part time out? [14:59:50] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1024 to wikikube-worker1165 [14:59:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1023 to wikikube-worker1164 - kamila@cumin1002" [14:59:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:59:56] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1164 [15:00:01] if so, I have theory as to why, which I'll follow up on shortly [15:00:11] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:00:12] swfrench-wmf and effie: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki infrastructure (UTC afternoon, one off) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1500). [15:01:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1164 [15:01:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1023 to wikikube-worker1164 [15:02:31] (03CR) 10Tiziano Fogli: [C:03+1] thanos: send sigkill as needed to stateless components [puppet] - 10https://gerrit.wikimedia.org/r/1114336 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi) [15:03:12] (03PS1) 10Hnowlan: fc-list: update font list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718) [15:03:44] I'm here, and will get started in the next 10-15 minutes [15:04:01] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1024 to wikikube-worker1165 - kamila@cumin1002" [15:04:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1024 to wikikube-worker1165 - kamila@cumin1002" [15:04:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:04:36] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1165 [15:06:01] swfrench-wmf: not sure, the number of left k8s nodes went basically to almost 0 and then starting growing again [15:06:22] so my patches are probably not 100% deployed, but maybe like 98+% [15:06:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1165 [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:04] if you want to, I can revert them, but on the other hand I would prefer if we could just try to fix that with another sync [15:07:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1024 to wikikube-worker1165 [15:07:13] zabe: got it, thank you! yeah, I think that would be consistent with a timeout for one specific subset of the k8s deployments. indeed you're right though that the primary ones are fully updated. [15:07:19] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1159.eqiad.wmnet wikikube-worker1160.eqiad.wmnet wikikube-worker1161.eqiad.wmnet wikikube-worker1162.eqiad.wmnet wikikube-worker1163.eqiad.wmnet wikikube-worker1164.eqiad.wmnet wikikube-worker1165.eqiad.wmnet on all recursors [15:07:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1159.eqiad.wmnet wikikube-worker1160.eqiad.wmnet wikikube-worker1161.eqiad.wmnet wikikube-worker1162.eqiad.wmnet wikikube-worker1163.eqiad.wmnet wikikube-worker1164.eqiad.wmnet wikikube-worker1165.eqiad.wmnet on all recursors [15:07:45] zabe: yeah, no need to revert - I'll take it from here :) [15:07:57] okay thanks:) [15:08:37] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:40] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez) [15:09:38] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl[1002-1003].eqiad.wmnet [15:09:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10497343 (10ops-monitoring-bot) depool host wikikube-ctrl[1002-1003].eqiad.wmnet by jayme@cumin1002... [15:09:52] !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl[1002-1003].eqiad.wmnet with reason: Depooled via sre.k8s.pool-depool-node [15:09:55] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl[1002-1003].eqiad.wmnet [15:10:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:10:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10497344 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1... [15:10:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72490 and previous config saved to /var/cache/conftool/dbconfig/20250127-151007-marostegui.json [15:10:12] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:10:28] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1160.eqiad.wmnet with OS bookworm [15:10:31] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1160 [15:10:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1160 [15:10:34] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1161.eqiad.wmnet with OS bookworm [15:10:37] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1161 [15:10:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1161 [15:10:45] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1162.eqiad.wmnet with OS bookworm [15:10:49] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1162 [15:10:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1162 [15:10:59] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1163.eqiad.wmnet with OS bookworm [15:11:02] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1163 [15:11:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1163 [15:11:03] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [15:11:04] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1164.eqiad.wmnet with OS bookworm [15:11:07] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1164 [15:11:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1164 [15:11:08] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1165.eqiad.wmnet with OS bookworm [15:11:11] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1165 [15:11:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1165 [15:11:34] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1159.eqiad.wmnet with OS bookworm [15:11:38] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1159 [15:11:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1159 [15:12:27] (03PS1) 10Scott French: mw-(api-ext|web): temporarily increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114400 (https://phabricator.wikimedia.org/T383845) [15:14:22] (03PS1) 10AikoChou: ml-services: update reference-quality storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114401 (https://phabricator.wikimedia.org/T384172) [15:15:01] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add v6 cloud-private address for cloudlb2003-dev - taavi@cumin1002" [15:15:05] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add v6 cloud-private address for cloudlb2003-dev - taavi@cumin1002" [15:15:05] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:14] (03PS28) 10Bking: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:15:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:15:24] PROBLEM - Etcd cluster health on wikikube-ctrl1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [15:15:34] PROBLEM - Etcd cluster health on wikikube-ctrl1001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [15:16:27] this is expected [15:16:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [15:16:41] jayme: thank you - was just about to ask :) [15:16:48] although not anticipated [15:16:58] (03CR) 10Hnowlan: [C:03+1] mw-(api-ext|web): temporarily increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114400 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:18:32] !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache cloudlb2003-dev.private.codfw.wikimedia.cloud on all recursors [15:18:35] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb2003-dev.private.codfw.wikimedia.cloud on all recursors [15:18:51] FIRING: [2x] KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:19:27] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): temporarily increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114400 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:19:28] swfrench-wmf: I'm misstaken... [15:19:32] FIRING: KubernetesCalicoDown: wikikube-worker1304.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1304.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:20:12] FIRING: [4x] ProbeDown: Service wikikube-ctrl1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:22] !incidents [15:20:23] 5636 (UNACKED) [4x] ProbeDown sre (probes/custom eqiad) [15:20:23] 5635 (RESOLVED) db2182 (paged)/MariaDB Replica SQL: s7 (paged) [15:20:23] 5634 (RESOLVED) db1241 (paged)/MariaDB Replica SQL: s4 (paged) [15:20:29] !ack 5636 [15:20:30] 5636 (ACKED) [4x] ProbeDown sre (probes/custom eqiad) [15:20:33] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:20:35] swfrench-wmf: this is me [15:20:39] shit [15:20:50] <_joe_> something's paging [15:20:52] (03Merged) 10jenkins-bot: mw-(api-ext|web): temporarily increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114400 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:21:01] <_joe_> jayme: is that you? [15:21:09] jayme: thanks! yeah, let me know how I can help. in the meantime, I'm starting to look in parallel [15:21:25] yeah it's me [15:21:33] <_joe_> swfrench-wmf: isn't this just the kube controller going down? [15:21:34] not etcd is blocking [15:22:05] <_joe_> jayme: come again? [15:22:10] ah, yeah I assumed this was the etcd issue causing that [15:22:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10497387 (10VRiley-WMF) [15:22:58] <_joe_> yeah sorry I was not looking at IRC at the moment [15:23:23] here [15:23:27] <_joe_> do we need an incident doc? [15:23:35] <_joe_> I don't think so, right? [15:23:52] I took down ctrl nodes in eqiad, expecting 2 remaining to be okay...they are about to be back [15:23:53] (03PS5) 10Bking: opensearch: Introduce resource for keystore values [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:23:56] <_joe_> swfrench-wmf: I would prepare to depool eqiad services [15:24:06] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:24:14] RECOVERY - Host ripe-atlas-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 0.62 ms [15:24:18] yes please [15:24:25] but give it another minute [15:25:04] _joe_: jayme: ack, yeah I will not touch that yet, but will start sorting the logistics [15:25:24] RECOVERY - Etcd cluster health on wikikube-ctrl1004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [15:25:32] \o/ [15:25:34] RECOVERY - Etcd cluster health on wikikube-ctrl1001 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [15:25:46] <_joe_> this happened because I logged into one node [15:25:51] <_joe_> etcd fears me [15:25:52] <_joe_> :P [15:26:02] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1160.eqiad.wmnet with reason: host reimage [15:26:04] :) [15:26:10] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1277:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:26:25] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1161.eqiad.wmnet with reason: host reimage [15:26:33] alright, I see API operations succeeding again [15:26:36] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1162.eqiad.wmnet with reason: host reimage [15:26:45] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1165.eqiad.wmnet with reason: host reimage [15:26:52] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1163.eqiad.wmnet with reason: host reimage [15:27:01] <_joe_> yeah crisis averted [15:27:05] swfrench-wmf: _joe_: should be good [15:27:17] jayme: great, thank you! [15:27:19] <_joe_> to be clear, I wasn't suggesting to already depool eqiad, but just to be ready to :) [15:27:36] well...100% my fault so please don't thank me :| [15:27:38] curious ... how did taking down a single control plane node do that? [15:27:42] (03CR) 10Bking: [C:03+2] opensearch: Introduce resource for keystore values [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:27:54] (03CR) 10DCausse: [C:03+1] "let's figure out how to do proper sanity checks in a separate ticket" [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:27:58] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1159.eqiad.wmnet with reason: host reimage [15:28:07] _joe_: yeah, totally - it was a good moment to start considering the "checklist" so to speak, though :) [15:28:29] (03CR) 10Ssingh: [C:03+1] wmflib,pybal: Add scheduler_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez) [15:28:51] RESOLVED: [2x] KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:29:32] RESOLVED: KubernetesCalicoDown: wikikube-worker1304.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1304.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:29:47] (03CR) 10Bking: [C:03+2] opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:29:55] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:30:01] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:30:02] (03CR) 10Bking: [C:03+2] opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:30:12] RESOLVED: [4x] ProbeDown: Service wikikube-ctrl1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:23] (03CR) 10Ottomata: [C:03+2] beta EventStreamConfig - set eventgate hoist_fields_from_http_headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114149 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [15:30:32] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:30:37] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:30:42] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [15:31:06] (03Merged) 10jenkins-bot: beta EventStreamConfig - set eventgate hoist_fields_from_http_headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114149 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [15:32:25] PROBLEM - Etcd cluster health on wikikube-ctrl1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [15:32:39] PROBLEM - Etcd cluster health on wikikube-ctrl1001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [15:32:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1160.eqiad.wmnet with reason: host reimage [15:32:51] the heck... [15:32:57] swfrench-wmf: problem still [15:33:15] swfrench-wmf: i just merged a mw-config change in InitialiseSettings-labs.php. [15:33:15] I don't need to scap deploy it in prod at all, but was going to do so just for good practice. Yall look busy(!) so perhaps I should skip this step? [15:33:45] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:33:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:34:14] jayme: ack, thanks - holding. lemme know if you need more hands. [15:34:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72491 and previous config saved to /var/cache/conftool/dbconfig/20250127-153435-marostegui.json [15:34:40] ottomata: it would be great if next time you would do it during a mediawiki backport window, since, in theory, we would be using this one for an infra deployment [15:34:41] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:34:50] swfrench-wmf: I'm in touch with dcops ... the nodes had network cables switched [15:35:12] FIRING: [4x] ProbeDown: Service wikikube-ctrl1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:35:17] jayme: ah, that's fun [15:35:22] !incidents [15:35:23] 5637 (UNACKED) [4x] ProbeDown sre (probes/custom eqiad) [15:35:23] 5636 (RESOLVED) [4x] ProbeDown sre (probes/custom eqiad) [15:35:24] 5635 (RESOLVED) db2182 (paged)/MariaDB Replica SQL: s7 (paged) [15:35:24] 5634 (RESOLVED) db1241 (paged)/MariaDB Replica SQL: s4 (paged) [15:35:27] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add v6 cloud-private address for cloudlb2002-dev - taavi@cumin1002" [15:35:29] !ack 5637 [15:35:30] 5637 (ACKED) [4x] ProbeDown sre (probes/custom eqiad) [15:35:31] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add v6 cloud-private address for cloudlb2002-dev - taavi@cumin1002" [15:35:31] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:35:47] !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache cloudlb2002-dev.private.codfw.wikimedia.cloud on all recursors [15:35:51] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb2002-dev.private.codfw.wikimedia.cloud on all recursors [15:35:53] effie: i'm sorry! you are right. I can revert if you prefer! I proceeded since it was just beta, but then asked in slack and realized good practice is to deploy -labs.php files in prod too, even though they are not used there. [15:36:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1161.eqiad.wmnet with reason: host reimage [15:36:27] is the latest alert same issue as previous one? [15:36:46] <_joe_> ottomata: wait for our green light, then deploy [15:36:51] FIRING: [3x] KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:37:01] _joe_: okay, will wait. ty [15:37:03] topranks: it sounds like at least in part? though I'm not sure about the details [15:37:40] topranks: yes [15:37:55] <_joe_> jayme: let us know what's going on / if we can help [15:38:19] jayme: is dc ops taking action that would resolve this, or do we need to do something? [15:38:33] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:39:00] dcops is aware and trying to fix [15:39:16] not sure why we lost connectivity again to ctrl1003 [15:39:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1162.eqiad.wmnet with reason: host reimage [15:39:55] jayme: ack, thanks [15:42:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1159.eqiad.wmnet with reason: host reimage [15:43:10] FIRING: [4x] KubernetesRsyslogDown: rsyslog on wikikube-worker1078:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:43:25] RECOVERY - Etcd cluster health on wikikube-ctrl1004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [15:43:41] RECOVERY - Etcd cluster health on wikikube-ctrl1001 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [15:44:14] * swfrench-wmf is cautiously optimistic [15:44:51] k8s api calls working in eqiad [15:44:55] alright, API operations are back [15:45:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10497449 (10VRiley-WMF) [15:45:12] RESOLVED: [4x] ProbeDown: Service wikikube-ctrl1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:49] ctrl1002 is back, 1003 still unreachable [15:45:50] jayme: do you need coordination assistance? e.g., would a doc help here (I can IC) [15:45:53] (03PS29) 10Bking: Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson) [15:45:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10497468 (10VRiley-WMF) [15:46:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1165.eqiad.wmnet with reason: host reimage [15:46:08] swfrench-wmf: no, thanks. Should be "good" now [15:46:43] (03PS1) 10Ottomata: beta wgEventStreams - set hoist_fields_from_http_headers on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114409 (https://phabricator.wikimedia.org/T382173) [15:46:51] RESOLVED: [3x] KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:47:22] jayme: got it, thank you! so 1 of 3 nodes is still unavailable, presumably due to a network issue IIUC? [15:48:06] yes, but that one is back up now as well [15:48:10] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1078:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:48:31] 06SRE, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10497480 (10Dzahn) Whatever we end up doing, let's resist the temptation to create yet another "-feed" channel (that few look at) because that... [15:48:33] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:48:34] awesome [15:48:49] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [15:49:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1163.eqiad.wmnet with reason: host reimage [15:49:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72493 and previous config saved to /var/cache/conftool/dbconfig/20250127-154933-root.json [15:49:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P72494 and previous config saved to /var/cache/conftool/dbconfig/20250127-154942-marostegui.json [15:50:42] swfrench-wmf: etcd is all happy again [15:51:22] jayme: awesome, thank you for confirming! [15:51:31] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [15:51:39] e.ffie and I will venture a backport deployment shortly, then [15:52:24] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt frnetmon1002 - vriley@cumin1002" [15:52:29] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt frnetmon1002 - vriley@cumin1002" [15:52:29] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1160.eqiad.wmnet with OS bookworm [15:54:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1161.eqiad.wmnet with OS bookworm [15:56:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2194 T384807', diff saved to https://phabricator.wikimedia.org/P72495 and previous config saved to /var/cache/conftool/dbconfig/20250127-155613-marostegui.json [15:56:18] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [15:56:31] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2194.codfw.wmnet [15:57:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113566 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:58:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1162.eqiad.wmnet with OS bookworm [15:58:38] (03Merged) 10jenkins-bot: Enroll 0.1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113566 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [16:00:10] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1113566|Enroll 0.1% of client sessions in PHP 8.1 (T383845)]] [16:00:14] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [16:00:31] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [16:00:43] ottomata: we will deploy your patch too as we are scap backporting [16:01:35] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2194.codfw.wmnet [16:01:52] (03CR) 10Klausman: [V:03+2 C:03+2] kserve: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114012 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [16:01:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1159.eqiad.wmnet with OS bookworm [16:02:31] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Index rebuild [16:02:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P72496 and previous config saved to /var/cache/conftool/dbconfig/20250127-160237-root.json [16:02:51] PROBLEM - Host ganeti2020 is DOWN: PING CRITICAL - Packet loss = 100% [16:03:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2194', diff saved to https://phabricator.wikimedia.org/P72497 and previous config saved to /var/cache/conftool/dbconfig/20250127-160300-marostegui.json [16:04:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72498 and previous config saved to /var/cache/conftool/dbconfig/20250127-160438-root.json [16:04:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P72499 and previous config saved to /var/cache/conftool/dbconfig/20250127-160449-marostegui.json [16:05:01] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1113566|Enroll 0.1% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:05:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1165.eqiad.wmnet with OS bookworm [16:06:06] (03Merged) 10jenkins-bot: kserve: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114012 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [16:07:45] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Index rebuild T384807 [16:07:49] T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807 [16:08:35] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1164.eqiad.wmnet with OS bookworm [16:08:37] FIRING: ProbeDown: Service ganeti2020:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1163.eqiad.wmnet with OS bookworm [16:09:47] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1164.eqiad.wmnet with OS bookworm [16:09:51] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1164 [16:09:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1164 [16:11:41] !log swfrench@deploy2002 swfrench: Continuing with sync [16:12:38] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2020 [16:12:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2020 [16:13:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2025.codfw.wmnet to cluster codfw and group D [16:15:59] RECOVERY - Host ganeti2020 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms [16:16:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2025.codfw.wmnet to cluster codfw and group D [16:18:21] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113566|Enroll 0.1% of client sessions in PHP 8.1 (T383845)]] (duration: 18m 11s) [16:18:24] (03PS1) 10Klausman: admin_ng/values/ml-staging: add cluster_group [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114414 (https://phabricator.wikimedia.org/T369493) [16:18:26] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [16:18:38] RESOLVED: ProbeDown: Service ganeti2020:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72501 and previous config saved to /var/cache/conftool/dbconfig/20250127-161932-root.json [16:19:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72502 and previous config saved to /var/cache/conftool/dbconfig/20250127-161944-root.json [16:19:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72503 and previous config saved to /var/cache/conftool/dbconfig/20250127-161956-marostegui.json [16:20:01] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [16:20:03] (03PS1) 10Fabfur: hiera: enable haproxykafka on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) [16:20:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2171.codfw.wmnet with reason: Maintenance [16:20:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T384592)', diff saved to https://phabricator.wikimedia.org/P72504 and previous config saved to /var/cache/conftool/dbconfig/20250127-162018-marostegui.json [16:21:05] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [16:23:33] (03PS1) 10Fabfur: hiera: enable haproxykafka on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) [16:25:28] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1164.eqiad.wmnet with reason: host reimage [16:26:07] (03Abandoned) 10Fabfur: hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [16:28:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1164.eqiad.wmnet with reason: host reimage [16:30:05] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1630). [16:30:24] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [16:30:40] swfrench-wm.f and I will be using the Wikimedia Portals Update deploy window folks [16:30:40] (03CR) 10Fabfur: [C:04-1] "Do not merge before 28/01/2025" [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [16:30:59] (03CR) 10Fabfur: [C:04-1] "Do not merge before 28/01/2025" [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [16:31:29] (03CR) 10Herron: [C:03+1] thanos: send sigkill as needed to stateless components [puppet] - 10https://gerrit.wikimedia.org/r/1114336 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi) [16:31:34] effie: okay thanks, I have another one i didn't merge [16:31:43] (meetings started anyway) [16:31:46] (03Abandoned) 10Klausman: admin_ng/values/ml-staging: add cluster_group [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114414 (https://phabricator.wikimedia.org/T369493) (owner: 10Klausman) [16:32:02] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:32:18] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:32:48] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl[1002-1003].eqiad.wmnet [16:32:50] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl[1002-1003].eqiad.wmnet [16:32:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl[1002-1003].eqiad.wmnet [16:32:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl[1002-1003].eqiad.wmnet [16:34:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 2%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72505 and previous config saved to /var/cache/conftool/dbconfig/20250127-163437-root.json [16:34:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72506 and previous config saved to /var/cache/conftool/dbconfig/20250127-163449-root.json [16:35:10] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1003.eqiad.wmnet with OS bookworm [16:39:55] (03CR) 10Klausman: [V:03+2 C:03+2] knative-serving: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114016 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [16:40:50] (03PS1) 10Elukey: services: update Kartotherian's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114420 (https://phabricator.wikimedia.org/T384530) [16:41:28] (03PS2) 10Elukey: services: update Kartotherian's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114420 (https://phabricator.wikimedia.org/T384530) [16:42:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T384592)', diff saved to https://phabricator.wikimedia.org/P72507 and previous config saved to /var/cache/conftool/dbconfig/20250127-164231-marostegui.json [16:42:37] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [16:43:56] (03CR) 10Elukey: [C:03+2] services: update Kartotherian's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114420 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [16:43:59] (03Merged) 10jenkins-bot: knative-serving: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114016 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [16:44:39] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:45:25] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:46:48] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2075'] [16:47:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2075'] [16:48:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1164.eqiad.wmnet with OS bookworm [16:49:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 3%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72508 and previous config saved to /var/cache/conftool/dbconfig/20250127-164942-root.json [16:49:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72509 and previous config saved to /var/cache/conftool/dbconfig/20250127-164955-root.json [16:52:18] !jouncebot nowandnext [16:52:18] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [16:52:35] jouncebot: nowandnext [16:52:35] For the next 0 hour(s) and 7 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1630) [16:52:35] In 1 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800) [16:52:35] In 1 hour(s) and 7 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800) [16:52:37] lol [16:52:56] hehe [16:54:45] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:56:18] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:57:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P72510 and previous config saved to /var/cache/conftool/dbconfig/20250127-165738-marostegui.json [16:58:24] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1159-1165].eqiad.wmnet [16:58:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1159-1165].eqiad.wmnet [16:58:32] alright, after a bit of a delay, we're going to ramp the fraction of enrolled traffic up a bit more (still at / below 1% of external web / API traffic) [17:01:01] (03PS2) 10Scott French: Enroll 1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113567 (https://phabricator.wikimedia.org/T383845) [17:03:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113567 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:03:49] (03Merged) 10jenkins-bot: Enroll 1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113567 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:04:04] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1113567|Enroll 1% of client sessions in PHP 8.1 (T383845)]] [17:04:09] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:04:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 4%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72511 and previous config saved to /var/cache/conftool/dbconfig/20250127-170448-root.json [17:05:30] FIRING: ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:09:05] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1113567|Enroll 1% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:09:10] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:09:21] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:09:27] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:10:10] !log swfrench@deploy2002 swfrench: Continuing with sync [17:10:30] RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P72512 and previous config saved to /var/cache/conftool/dbconfig/20250127-171245-marostegui.json [17:12:57] (03PS1) 10Elukey: admin_ng: disable PSP mutation for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114423 (https://phabricator.wikimedia.org/T369493) [17:13:53] (03CR) 10Klausman: [C:03+1] admin_ng: disable PSP mutation for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114423 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [17:14:45] effie: swfrench-wmf, are you all still deploying? I have another beta only patch to merge. I don't need it deployed in production, but it should go out eventually. [17:14:45] I absolutely can wait if that is better for you [17:15:19] ottomata: thanks for checking! yes, we're still deploying, so that would be great if you could hold. [17:15:55] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:18:06] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113567|Enroll 1% of client sessions in PHP 8.1 (T383845)]] (duration: 14m 02s) [17:18:11] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:18:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:19:24] (03PS1) 10C. Scott Ananian: Condense wikivoyage configuration options for Parsoid Read Views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367) [17:19:32] (03CR) 10Clare Ming: "should we enable this for labswiki too?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) (owner: 10Phuedx) [17:19:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72513 and previous config saved to /var/cache/conftool/dbconfig/20250127-171953-root.json [17:22:17] (03PS2) 10C. Scott Ananian: Condense wikivoyage configuration options for Parsoid Read Views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367) [17:24:29] swfrench-wmf: 👍 ty [17:25:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.143s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:26:27] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, lint error seems unrelated, think we need to add a line telling it to ignore it for that function" [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 (owner: 10Muehlenhoff) [17:27:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T384592)', diff saved to https://phabricator.wikimedia.org/P72514 and previous config saved to /var/cache/conftool/dbconfig/20250127-172752-marostegui.json [17:27:57] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:28:08] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2178.codfw.wmnet with reason: Maintenance [17:28:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T384592)', diff saved to https://phabricator.wikimedia.org/P72515 and previous config saved to /var/cache/conftool/dbconfig/20250127-172814-marostegui.json [17:28:23] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:30:05] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update reference-quality storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114401 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [17:30:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.143s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:30:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:34:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72516 and previous config saved to /var/cache/conftool/dbconfig/20250127-173458-root.json [17:35:42] jouncebot: now [17:35:43] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC afternoon, 2nd attempt) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1705) [17:35:49] ottomata: we are done, you could use the rest of our window if you want [17:38:30] effie: ty [17:38:35] doing! [17:40:20] (03CR) 10Ottomata: [C:03+2] beta wgEventStreams - set hoist_fields_from_http_headers on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114409 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [17:41:17] (03CR) 10Subramanya Sastry: [C:03+1] Condense wikivoyage configuration options for Parsoid Read Views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian) [17:41:33] (03Merged) 10jenkins-bot: beta wgEventStreams - set hoist_fields_from_http_headers on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114409 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [17:48:10] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [17:48:28] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [17:48:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T384592)', diff saved to https://phabricator.wikimedia.org/P72517 and previous config saved to /var/cache/conftool/dbconfig/20250127-174833-marostegui.json [17:48:39] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:49:22] (03PS1) 10Volans: sre.hosts.decommission: fix CI [cookbooks] - 10https://gerrit.wikimedia.org/r/1114432 [17:50:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72518 and previous config saved to /var/cache/conftool/dbconfig/20250127-175004-root.json [17:50:12] (03CR) 10Volans: "rebasing on top of I0b9bd18c5c9d606dca49c580075b2aa0e9e9a677 should fix it" [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 (owner: 10Muehlenhoff) [17:54:40] (03CR) 10Dzahn: [V:04-1 C:04-1] "Ah yea.. so I made this before there was "profile::tlsproxy::envoy::firewall_src_sets". It was an attempt to make it work with an empty (n" [puppet] - 10https://gerrit.wikimedia.org/r/1055491 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:55:24] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudgw1003.eqiad.wmnet with OS bookworm [17:55:43] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114433 [17:55:56] jouncebot: nowandnext [17:55:57] For the next 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC afternoon, 2nd attempt) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1705) [17:55:57] In 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800) [17:55:57] In 0 hour(s) and 4 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800) [17:56:20] (03CR) 10Volans: [C:03+2] "self merging, trivial." [cookbooks] - 10https://gerrit.wikimedia.org/r/1114432 (owner: 10Volans) [17:56:54] (03CR) 10Dzahn: [V:04-1 C:04-1] "let's use your patch in this case. it already has more lines and is more current. I will abandon this one in favor of yours." [puppet] - 10https://gerrit.wikimedia.org/r/1055491 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:57:05] (03Abandoned) 10Dzahn: ci: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055491 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:00:05] swfrench-wmf and effie: Your horoscope predicts another MediaWiki infrastructure (UTC afternoon, 2nd attempt) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1705). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800) [18:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800). [18:00:12] (03PS1) 10Reedy: LicenseParser: Avoid passing null to string functions [extensions/CommonsMetadata] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114435 (https://phabricator.wikimedia.org/T384853) [18:00:24] (03CR) 10Reedy: [C:03+2] LicenseParser: Avoid passing null to string functions [extensions/CommonsMetadata] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114435 (https://phabricator.wikimedia.org/T384853) (owner: 10Reedy) [18:01:13] ah, interesting side effect of overlapping deployment windows :) [18:02:19] no further deployments planned on our end, but it looks like R.eedy is preparing cherrypicks to backport for the could of deprecation errors we've seen [18:02:20] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix CI [cookbooks] - 10https://gerrit.wikimedia.org/r/1114432 (owner: 10Volans) [18:03:13] more thank happy to see those move during the window if ready [18:03:16] *than [18:03:32] (03Merged) 10jenkins-bot: LicenseParser: Avoid passing null to string functions [extensions/CommonsMetadata] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114435 (https://phabricator.wikimedia.org/T384853) (owner: 10Reedy) [18:03:34] (03PS3) 10Muehlenhoff: sre.hosts.reimage: Add link to the help text for move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 [18:03:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P72520 and previous config saved to /var/cache/conftool/dbconfig/20250127-180341-marostegui.json [18:05:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72521 and previous config saved to /var/cache/conftool/dbconfig/20250127-180509-root.json [18:05:40] (03PS14) 10Dzahn: releases: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (https://phabricator.wikimedia.org/T370677) [18:07:02] (03PS15) 10Dzahn: releases: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (https://phabricator.wikimedia.org/T370677) [18:08:28] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114437 [18:08:31] (03PS1) 10Dzahn: Revert "gerrit: block alibaba Cloud IPs" [puppet] - 10https://gerrit.wikimedia.org/r/1114438 [18:08:51] (03CR) 10BCornwall: [C:03+1] Upgrade to CAS 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1114388 (owner: 10Slyngshede) [18:09:02] Gets the fixed noise out of the way... especially when I'd probably expect more like that from the same underlying PHP functions [18:09:10] (03CR) 10Dzahn: "Just created this because I saw someone added a TODO to revert this. Do you (still) think we should revert it now or just keep it as addit" [puppet] - 10https://gerrit.wikimedia.org/r/1114438 (owner: 10Dzahn) [18:10:01] !log tchin@deploy2002 Started deploy [airflow-dags/analytics@c49f40b]: Deploying airflow for T357684 [18:10:06] T357684: Dashboard and alerting of data quality metrics for wmf_content.mediawiki_content_history_v1 - https://phabricator.wikimedia.org/T357684 [18:10:39] !log tchin@deploy2002 Finished deploy [airflow-dags/analytics@c49f40b]: Deploying airflow for T357684 (duration: 01m 01s) [18:13:08] (03PS1) 10Daimona Eaytoy: prod: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818) [18:13:38] (03CR) 10Clare Ming: [C:03+1] testwiki: Enable MetricsPlatform experiment enrollment overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) (owner: 10Phuedx) [18:13:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818) (owner: 10Daimona Eaytoy) [18:13:50] (03CR) 10CI reject: [V:04-1] prod: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818) (owner: 10Daimona Eaytoy) [18:14:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) (owner: 10Phuedx) [18:14:08] (03CR) 10Dzahn: [C:04-1] "so.. rebased and "just switch it" currently fails in this manner, FYI:" [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:15:01] (03CR) 10Dzahn: [C:04-1] "so it's still requiring / looking for ferm related resources" [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:16:43] !log reedy@deploy2002 Synchronized php-1.44.0-wmf.13/extensions/CommonsMetadata/: T384853 T384854 (duration: 10m 45s) [18:16:49] T384853: PHP Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384853 [18:16:49] T384854: PHP Deprecated: strtolower(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384854 [18:17:15] (03PS2) 10Daimona Eaytoy: prod: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818) [18:18:23] (03PS2) 10BCornwall: controol: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [18:18:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P72522 and previous config saved to /var/cache/conftool/dbconfig/20250127-181847-marostegui.json [18:20:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72523 and previous config saved to /var/cache/conftool/dbconfig/20250127-182014-root.json [18:20:15] (03CR) 10BCornwall: "Thanks, makes sense! I've updated the PS" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [18:20:51] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1004.eqiad.wmnet with OS bookworm [18:23:09] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4866/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [18:23:34] (03PS1) 10Dzahn: gerrit: remove UA-based blocking of some old bots/spiders [puppet] - 10https://gerrit.wikimedia.org/r/1114442 [18:24:32] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4867/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [18:26:27] (03PS3) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [18:26:48] (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe) [18:26:50] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4868/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [18:27:03] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudgw1004.eqiad.wmnet with OS bookworm [18:27:48] (03PS3) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) [18:28:35] (03CR) 10Hashar: "Some of those crawlers still have hit (Baidu, Sogou, bingbot). I revisited them some months ago :)" [puppet] - 10https://gerrit.wikimedia.org/r/1114442 (owner: 10Dzahn) [18:30:23] (03PS4) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [18:30:44] (03CR) 10Dzahn: "digging back further, these are PRE 2012/2013" [puppet] - 10https://gerrit.wikimedia.org/r/1114442 (owner: 10Dzahn) [18:30:54] (03PS2) 10Dzahn: gerrit: remove UA-based blocking of some old bots/spiders [puppet] - 10https://gerrit.wikimedia.org/r/1114442 [18:31:54] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [18:33:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T384592)', diff saved to https://phabricator.wikimedia.org/P72524 and previous config saved to /var/cache/conftool/dbconfig/20250127-183355-marostegui.json [18:34:00] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [18:34:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2192.codfw.wmnet with reason: Maintenance [18:34:17] (03PS4) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) [18:34:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T384592)', diff saved to https://phabricator.wikimedia.org/P72525 and previous config saved to /var/cache/conftool/dbconfig/20250127-183417-marostegui.json [18:35:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72526 and previous config saved to /var/cache/conftool/dbconfig/20250127-183519-root.json [18:35:27] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt pay-lb1001 - vriley@cumin1002" [18:35:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt pay-lb1001 - vriley@cumin1002" [18:35:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:36:20] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114146 (owner: 10TrainBranchBot) [18:44:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114445 [18:46:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P72527 and previous config saved to /var/cache/conftool/dbconfig/20250127-184642-root.json [18:48:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2194', diff saved to https://phabricator.wikimedia.org/P72528 and previous config saved to /var/cache/conftool/dbconfig/20250127-184839-marostegui.json [18:51:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T384592)', diff saved to https://phabricator.wikimedia.org/P72529 and previous config saved to /var/cache/conftool/dbconfig/20250127-185104-marostegui.json [18:51:09] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [18:52:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): decommission cloudelastic100[5-6] - https://phabricator.wikimedia.org/T380937#10498376 (10Papaul) [18:52:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): decommission cloudelastic100[5-6] - https://phabricator.wikimedia.org/T380937#10498379 (10Papaul) 05Open→03Resolved a:03Papaul complete [18:56:12] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [18:56:33] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384645#10498403 (10Papaul) 05Open→03Resolved a:03Papaul fixed [18:57:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72530 and previous config saved to /var/cache/conftool/dbconfig/20250127-185715-root.json [18:59:08] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114445 (owner: 10TrainBranchBot) [18:59:42] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt pay-lb1002~ - vriley@cumin1002" [18:59:46] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt pay-lb1002~ - vriley@cumin1002" [18:59:46] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:00:02] (03PS1) 10Mstyles: security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114446 (https://phabricator.wikimedia.org/T383098) [19:00:43] (03CR) 10Ssingh: "Your change looks good but I think we will need to update one more thing and additionally, check that we are not referencing ats-be anywhe" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [19:06:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P72531 and previous config saved to /var/cache/conftool/dbconfig/20250127-190611-marostegui.json [19:08:37] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72532 and previous config saved to /var/cache/conftool/dbconfig/20250127-191220-root.json [19:13:35] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:15:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10498484 (10VRiley-WMF) [19:16:52] 06SRE, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10498487 (10Quiddity) Side-note in case it helps anyone (and probably only potentially helps IRCCloud users): I've been using some custom-CSS... [19:17:11] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding puppetserver2004 to codfw - jhancock@cumin2002" [19:17:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding puppetserver2004 to codfw - jhancock@cumin2002" [19:17:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:18:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:18:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:19:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:21:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P72533 and previous config saved to /var/cache/conftool/dbconfig/20250127-192118-marostegui.json [19:21:50] (03PS1) 10Jforrester: MemcachedBagOStuff: Null coalescing $component [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114448 (https://phabricator.wikimedia.org/T384858) [19:23:36] (03PS1) 10Ottomata: beta wgEventStreams - test Opt out of collecting user-agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114449 (https://phabricator.wikimedia.org/T382173) [19:24:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:25:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:25:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:26:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10498501 (10Jhancock.wm) [19:26:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10498503 (10Jhancock.wm) provisioning failing. will check bios settings later and then try again. [19:27:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72534 and previous config saved to /var/cache/conftool/dbconfig/20250127-192725-root.json [19:28:02] (03CR) 10Ottomata: [C:03+2] beta wgEventStreams - test Opt out of collecting user-agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114449 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [19:28:53] (03Merged) 10jenkins-bot: beta wgEventStreams - test Opt out of collecting user-agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114449 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata) [19:36:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T384592)', diff saved to https://phabricator.wikimedia.org/P72535 and previous config saved to /var/cache/conftool/dbconfig/20250127-193625-marostegui.json [19:36:30] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [19:36:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2201.codfw.wmnet with reason: Maintenance [19:42:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72536 and previous config saved to /var/cache/conftool/dbconfig/20250127-194231-root.json [19:50:14] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[3-8] - https://phabricator.wikimedia.org/T384838#10498575 (10RobH) [19:51:31] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [19:52:09] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[46-51] - https://phabricator.wikimedia.org/T384838#10498581 (10RobH) [19:57:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72537 and previous config saved to /var/cache/conftool/dbconfig/20250127-195736-root.json [19:59:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2211.codfw.wmnet with reason: Maintenance [19:59:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T384592)', diff saved to https://phabricator.wikimedia.org/P72538 and previous config saved to /var/cache/conftool/dbconfig/20250127-195939-marostegui.json [19:59:45] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [20:02:16] (03PS1) 10Jforrester: SimpleCaptcha: Don't look up captcha if no ID was given [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114454 (https://phabricator.wikimedia.org/T384858) [20:06:37] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10498642 (10RobH) After Andrew pinged about this today in IRC, I can see on the system it has the alarms on idrac: System Inlet Temperature 35 °C (95.0 °F) w... [20:08:27] (03CR) 10Hashar: "You can look at the accesslog via https://logstash.wikimedia.org/app/dashboards#/view/825c5c80-8aef-11eb-8ab2-63c7f3b019fc and filtering o" [puppet] - 10https://gerrit.wikimedia.org/r/1114442 (owner: 10Dzahn) [20:13:55] (03PS1) 10TheAnarcat: dump backtrace on exception, on --trace [software/cumin] - 10https://gerrit.wikimedia.org/r/1114456 (https://phabricator.wikimedia.org/T384539) [20:18:08] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10498659 (10RobH) >>! In T383723#10498642, @RobH wrote: > After Andrew pinged about this today in IRC, I can see on the system it has the alarms on idrac: Sys... [20:18:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T384592)', diff saved to https://phabricator.wikimedia.org/P72539 and previous config saved to /var/cache/conftool/dbconfig/20250127-201832-marostegui.json [20:20:35] (03CR) 10Fabfur: [C:03+1] controol: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [20:21:34] 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10498666 (10RLazarus) [20:26:06] 06SRE, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Add x-analytics nocookie=1 and x-tls-sess to webrequest-sampled-live stream - https://phabricator.wikimedia.org/T383900#10498690 (10RLazarus) [20:28:21] (03CR) 10SBassett: [C:03+2] security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114446 (https://phabricator.wikimedia.org/T383098) (owner: 10Mstyles) [20:28:42] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384869 (10phaultfinder) 03NEW [20:29:37] (03Merged) 10jenkins-bot: security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114446 (https://phabricator.wikimedia.org/T383098) (owner: 10Mstyles) [20:33:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P72540 and previous config saved to /var/cache/conftool/dbconfig/20250127-203339-marostegui.json [20:36:37] 06SRE, 10MW-on-K8s, 06serviceops: mwgrep cannot be used from a deployment host - https://phabricator.wikimedia.org/T384764#10498755 (10RLazarus) This isn't working because it was never upgraded to Python 3. (`reload` was a built-in function in Python 2, moved to `importlib` in 3.) The mwmaint hosts are still... [20:36:44] !log power down logging-hd1005 for maintenance [20:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:27] PROBLEM - Host logging-hd1005 is DOWN: PING CRITICAL - Packet loss = 100% [20:39:15] 06SRE, 07SRE-Unowned, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10498758 (10RLazarus) [20:45:50] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:45:59] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-hd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:48:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P72541 and previous config saved to /var/cache/conftool/dbconfig/20250127-204846-marostegui.json [20:50:49] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:56:22] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10498829 (10VRiley-WMF) IP address was not setup on the managment port. Reran the cookbook and it set it in place. This should be good to go. [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T2100). [21:00:05] cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:53] hi ! i will self-deploy [21:01:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:01:46] (03PS2) 10Phuedx: testwiki: Enable MetricsPlatform experiment enrollment overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) [21:03:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) (owner: 10Phuedx) [21:03:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T384592)', diff saved to https://phabricator.wikimedia.org/P72542 and previous config saved to /var/cache/conftool/dbconfig/20250127-210353-marostegui.json [21:03:58] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [21:04:06] (03Merged) 10jenkins-bot: testwiki: Enable MetricsPlatform experiment enrollment overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) (owner: 10Phuedx) [21:04:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2223.codfw.wmnet with reason: Maintenance [21:04:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72543 and previous config saved to /var/cache/conftool/dbconfig/20250127-210415-marostegui.json [21:04:26] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1114396|testwiki: Enable MetricsPlatform experiment enrollment overrides (T384728)]] [21:04:30] T384728: Enable MetricsPlatform overrides on testwiki - https://phabricator.wikimedia.org/T384728 [21:05:09] RECOVERY - Host logging-hd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [21:08:07] !log cjming@deploy2002 phuedx, cjming: Backport for [[gerrit:1114396|testwiki: Enable MetricsPlatform experiment enrollment overrides (T384728)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:13] !log cjming@deploy2002 phuedx, cjming: Continuing with sync [21:09:44] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10498863 (10VRiley-WMF) 05Open→03Resolved [21:12:57] (03CR) 10Gergő Tisza: "It just seems like there is a set of checks we'd need to repeat every time we add a new extension, or after certain changes in existing ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114351 (owner: 10Gergő Tisza) [21:14:56] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114396|testwiki: Enable MetricsPlatform experiment enrollment overrides (T384728)]] (duration: 10m 30s) [21:15:01] T384728: Enable MetricsPlatform overrides on testwiki - https://phabricator.wikimedia.org/T384728 [21:15:39] i'll hang out for a bit in case anyone else shows up -- then close the backport window [21:26:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72544 and previous config saved to /var/cache/conftool/dbconfig/20250127-212656-marostegui.json [21:27:01] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [21:27:20] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384869#10498919 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Reseated power supply. It shows that it's normal now. [21:29:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10498923 (10phaultfinder) [21:33:15] !log end of UTC late backport window [21:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P72546 and previous config saved to /var/cache/conftool/dbconfig/20250127-214203-marostegui.json [21:57:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P72547 and previous config saved to /var/cache/conftool/dbconfig/20250127-215710-marostegui.json [22:00:04] Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T2200). nyaa~ [22:12:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72548 and previous config saved to /var/cache/conftool/dbconfig/20250127-221217-marostegui.json [22:12:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2228.codfw.wmnet with reason: Maintenance [22:12:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2186.codfw.wmnet with reason: Maintenance [22:12:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T384592)', diff saved to https://phabricator.wikimedia.org/P72549 and previous config saved to /var/cache/conftool/dbconfig/20250127-221255-marostegui.json [22:29:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T384592)', diff saved to https://phabricator.wikimedia.org/P72550 and previous config saved to /var/cache/conftool/dbconfig/20250127-222947-marostegui.json [22:29:53] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [22:31:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10499096 (10VRiley-WMF) Is this okay to be closed? [22:37:10] (03CR) 10Btullis: [V:03+1 C:03+2] Add the service_proxy to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [22:38:03] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1114336 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi) [22:38:28] !log mstyles@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [22:39:10] !log mstyles@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [22:39:36] !log mstyles@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:39:50] !log mstyles@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [22:40:21] !log mstyles@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [22:40:47] !log mstyles@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [22:41:10] !log mstyles@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:41:12] !log mstyles@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [22:41:20] !log mstyles@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [22:41:23] !log mstyles@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [22:44:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P72552 and previous config saved to /var/cache/conftool/dbconfig/20250127-224455-marostegui.json [22:45:36] !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudgw1003.eqiad.wmnet [22:46:11] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudgw1003.eqiad.wmnet [22:46:20] !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudgw1003.eqiad.wmnet [22:51:17] (03CR) 10Btullis: [V:03+1 C:03+2] "We added support for this today - in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114383" [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [22:55:35] (03PS1) 10Btullis: Add the mw-misc service_proxy listener to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114470 (https://phabricator.wikimedia.org/T384329) [22:57:02] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4870/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114470 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [23:00:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P72553 and previous config saved to /var/cache/conftool/dbconfig/20250127-230002-marostegui.json [23:02:28] (03Abandoned) 10Reedy: Improve error summary [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110757 (https://phabricator.wikimedia.org/T381333) (owner: 10Reedy) [23:02:31] (03Abandoned) 10Reedy: Fix UW error summary [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110755 (https://phabricator.wikimedia.org/T383182) (owner: 10Reedy) [23:06:50] (03CR) 10Btullis: [V:03+1 C:03+2] Add the mw-misc service_proxy listener to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114470 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis) [23:08:38] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:19] (03CR) 10Urbanecm: [C:04-1] Add configurable MinimumTasksPerTopic (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime) [23:15:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T384592)', diff saved to https://phabricator.wikimedia.org/P72554 and previous config saved to /var/cache/conftool/dbconfig/20250127-231509-marostegui.json [23:15:14] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [23:22:35] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10499217 (10VRiley-WMF) The servers were getting the IP address from private 1-C and private 1-D, and not from th... [23:51:31] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer