[00:03:37] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_gnmic.service on netflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:10:40] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 633.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:19:45] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:23:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:28:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:33:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:38:03] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114146
[00:38:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114146 (owner: 10TrainBranchBot)
[00:39:45] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:43:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:48:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:53:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:53:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114146 (owner: 10TrainBranchBot)
[00:54:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:03:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:04:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:08:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114147
[01:08:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114147 (owner: 10TrainBranchBot)
[01:09:45] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:13:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:24:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:27:14] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1114147 (owner: 10TrainBranchBot)
[01:28:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:33:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:38:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:40:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[02:03:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:04:45] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:14:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:18:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:18:40] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:23:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:28:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:33:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:34:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:45] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:48:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:49:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:08:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:18:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:23:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:38:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:39:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:46:36] <wikibugs>	 (03PS1) 10Ottomata: beta EventStreamConfig - set eventgate hoist_fields_from_http_headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114149 (https://phabricator.wikimedia.org/T382173)
[03:51:31] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[03:53:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:58:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:03:38] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_gnmic.service on netflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:04:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:08:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:24:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:29:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:03:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:04:45] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:40:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:48:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:53:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:56:57] <wikibugs>	 06SRE, 06DBA: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10495668 (10Marostegui) a:03Marostegui
[06:08:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:09:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:13:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:22:46] <wikibugs>	 (03PS1) 10Marostegui: db2182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114258 (https://phabricator.wikimedia.org/T384801)
[06:23:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:23:38] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Index rebuild + upgrade
[06:24:42] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114258 (https://phabricator.wikimedia.org/T384801) (owner: 10Marostegui)
[06:24:43] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[06:28:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:33:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:43:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:44:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:48:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:49:45] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:54:00] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2204.codfw.wmnet with reason: Maintenance
[07:08:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:18:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:19:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:20:50] <wikibugs>	 (03PS1) 10Slyngshede: Failover IDP before reboot [dns] - 10https://gerrit.wikimedia.org/r/1114266
[07:23:08] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Failover IDP before reboot [dns] - 10https://gerrit.wikimedia.org/r/1114266 (owner: 10Slyngshede)
[07:23:19] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[07:25:09] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[07:28:02] <hashar>	 jouncebot: nowandnext
[07:28:02] <jouncebot>	 For the next 0 hour(s) and 31 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250126T0800)
[07:28:02] <jouncebot>	 In 0 hour(s) and 31 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T0800)
[07:28:30] <hashar>	 hmm
[07:29:46] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:33:16] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:33:16] <wikibugs>	 (03PS1) 10Samtar: IS: Enable wgUseCodexSpecialBlock on prod test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114324 (https://phabricator.wikimedia.org/T377121)
[07:33:16] <wikibugs>	 (03CR) 10Samtar: [C:04-2] "Do not merge: Blocked on T377121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114324 (https://phabricator.wikimedia.org/T377121) (owner: 10Samtar)
[07:35:48] <moritzm>	 !log installing tomcat10 security updates
[07:35:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:24] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1159.eqiad.wmnet with reason: Maintenance
[07:40:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72439 and previous config saved to /var/cache/conftool/dbconfig/20250127-074030-marostegui.json
[07:40:35] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[07:42:32] <wikibugs>	 06SRE, 06DBA, 13Patch-For-Review: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10495691 (10Marostegui)
[07:42:53] <wikibugs>	 06SRE, 06DBA, 13Patch-For-Review: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10495693 (10Marostegui) p:05Triage→03Medium The host was upgraded and the tables are now being rebuilt.
[07:43:16] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:44:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2004.wikimedia.org
[07:45:20] <icinga-wm>	 PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 465MiB (3% inode=36%): /tmp 465MiB (3% inode=36%): /var/tmp 465MiB (3% inode=36%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops
[07:46:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2004.wikimedia.org
[07:51:31] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[07:52:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover back to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1114325
[07:57:24] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/1114325 (owner: 10Muehlenhoff)
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:01:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72440 and previous config saved to /var/cache/conftool/dbconfig/20250127-080112-marostegui.json
[08:01:18] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[08:02:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet
[08:02:32] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10495702 (10ops-monitoring-bot) Draining ganeti2020.codfw.wmnet of running VMs
[08:03:37] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_gnmic.service on netflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:20] <icinga-wm>	 RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops
[08:08:32] <wikibugs>	 06SRE, 10Deployments, 06Release-Engineering-Team: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804 (10hashar) 03NEW
[08:12:33] <moritzm>	 !log installing rsync regression updates on bullseye
[08:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P72441 and previous config saved to /var/cache/conftool/dbconfig/20250127-081619-marostegui.json
[08:21:29] <wikibugs>	 (03PS1) 10DCausse: airflow: enable show_trigger_form_if_no_params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805)
[08:23:27] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove es1023 [puppet] - 10https://gerrit.wikimedia.org/r/1114328 (https://phabricator.wikimedia.org/T384679)
[08:24:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1023.eqiad.wmnet
[08:26:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove es1023 [puppet] - 10https://gerrit.wikimedia.org/r/1114328 (https://phabricator.wikimedia.org/T384679) (owner: 10Marostegui)
[08:26:50] <wikibugs>	 (03CR) 10Volans: k8s.pool-depool-node: Add support to downtime/remove downtime (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm)
[08:30:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[08:30:30] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse)
[08:30:54] <wikibugs>	 (03PS1) 10Fabfur: hiera: add haproxykafka to esams [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578)
[08:31:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P72442 and previous config saved to /var/cache/conftool/dbconfig/20250127-083126-marostegui.json
[08:32:07] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[08:34:40] <wikibugs>	 (03CR) 10DCausse: "I agree but seems like none of our dags are using those, I would suggest to add this config while we agree and migrate existing DAGs to th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse)
[08:37:10] <wikibugs>	 (03CR) 10Brouberol: "Nicely done! Do you want to test this on airflow-test-k8s before we merge, to make sure this does what we want?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse)
[08:37:50] <wikibugs>	 (03PS2) 10Fabfur: hiera: add haproxykafka to esams [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578)
[08:38:04] <wikibugs>	 (03CR) 10DCausse: "sure! if possible please let me know how to do this, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse)
[08:41:22] <wikibugs>	 06SRE, 06DBA, 13Patch-For-Review: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10495761 (10Marostegui)
[08:41:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72443 and previous config saved to /var/cache/conftool/dbconfig/20250127-084145-root.json
[08:42:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1023.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[08:42:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1023.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[08:42:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:42:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1023.eqiad.wmnet
[08:42:34] <moritzm>	 !log installing gtk+3.0 security updates
[08:42:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:53] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1023.eqiad.wmnet - https://phabricator.wikimedia.org/T384679#10495768 (10Marostegui) a:05Marostegui→03None
[08:43:09] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1023.eqiad.wmnet - https://phabricator.wikimedia.org/T384679#10495773 (10Marostegui) This is ready for #dc-ops
[08:43:46] <wikibugs>	 (03PS1) 10Volans: netbox: use asctime in the logs [puppet] - 10https://gerrit.wikimedia.org/r/1114331 (https://phabricator.wikimedia.org/T379072)
[08:44:05] <wikibugs>	 (03PS2) 10Volans: netbox: use asctime in the logs [puppet] - 10https://gerrit.wikimedia.org/r/1114331 (https://phabricator.wikimedia.org/T379072)
[08:44:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10495782 (10Volans) I've sent the above patch that I think should fix the issue.
[08:46:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T384592)', diff saved to https://phabricator.wikimedia.org/P72444 and previous config saved to /var/cache/conftool/dbconfig/20250127-084633-marostegui.json
[08:46:38] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[08:46:49] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[08:47:06] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:47:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T384592)', diff saved to https://phabricator.wikimedia.org/P72445 and previous config saved to /var/cache/conftool/dbconfig/20250127-084713-marostegui.json
[08:48:02] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[08:48:05] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Really nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1114331 (https://phabricator.wikimedia.org/T379072) (owner: 10Volans)
[08:48:15] <wikibugs>	 (03PS1) 10Marostegui: rebuild_tables.sh: Add linter [software] - 10https://gerrit.wikimedia.org/r/1114332 (https://phabricator.wikimedia.org/T384799)
[08:48:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1223', diff saved to https://phabricator.wikimedia.org/P72446 and previous config saved to /var/cache/conftool/dbconfig/20250127-084857-marostegui.json
[08:49:12] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1223.eqiad.wmnet
[08:49:32] <marostegui>	 !log Upgrade db1223 T384807
[08:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:36] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[08:50:41] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Index rebuild + upgrade
[08:51:53] <wikibugs>	 (03CR) 10Marostegui: "FYI guys!" [software] - 10https://gerrit.wikimedia.org/r/1114332 (https://phabricator.wikimedia.org/T384799) (owner: 10Marostegui)
[08:51:55] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh: Add linter [software] - 10https://gerrit.wikimedia.org/r/1114332 (https://phabricator.wikimedia.org/T384799) (owner: 10Marostegui)
[08:52:20] <wikibugs>	 (03Merged) 10jenkins-bot: rebuild_tables.sh: Add linter [software] - 10https://gerrit.wikimedia.org/r/1114332 (https://phabricator.wikimedia.org/T384799) (owner: 10Marostegui)
[08:54:45] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1223.eqiad.wmnet
[08:55:01] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "hieradata/hosts/cp3066.yaml is no longer needed" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[08:55:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10495816 (10MoritzMuehlenhoff)
[08:56:01] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Index rebuild
[08:56:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72447 and previous config saved to /var/cache/conftool/dbconfig/20250127-085650-root.json
[08:56:57] <moritzm>	 !log installing net-tools bugfix updates on bullseye
[08:57:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:08] <wikibugs>	 (03PS1) 10Marostegui: rebuild_tables.sh: Add downtime [software] - 10https://gerrit.wikimedia.org/r/1114333 (https://phabricator.wikimedia.org/T382842)
[08:58:00] <wikibugs>	 (03CR) 10Marostegui: "FYI" [software] - 10https://gerrit.wikimedia.org/r/1114333 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[08:58:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh: Add downtime [software] - 10https://gerrit.wikimedia.org/r/1114333 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[08:59:54] <wikibugs>	 (03Merged) 10jenkins-bot: rebuild_tables.sh: Add downtime [software] - 10https://gerrit.wikimedia.org/r/1114333 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[09:00:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Failover back to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1114325 (owner: 10Muehlenhoff)
[09:00:25] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[09:01:20] <wikibugs>	 (03PS3) 10Fabfur: hiera: add haproxykafka to esams [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578)
[09:02:15] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[09:07:49] <icinga-wm>	 RECOVERY - Host ms-fe1014 is UP: PING OK - Packet loss = 0%, RTA = 80.20 ms
[09:08:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 2 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10495846 (10JMeybohm)
[09:08:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T384592)', diff saved to https://phabricator.wikimedia.org/P72448 and previous config saved to /var/cache/conftool/dbconfig/20250127-090833-marostegui.json
[09:08:38] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[09:11:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10495849 (10MoritzMuehlenhoff)
[09:11:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72449 and previous config saved to /var/cache/conftool/dbconfig/20250127-091155-root.json
[09:14:13] <icinga-wm>	 PROBLEM - Host ms-fe1014 is DOWN: PING CRITICAL - Packet loss = 100%
[09:14:47] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Persistence-Automations, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#10495866 (10FCeratto-WMF)
[09:16:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch an-test-presto1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1111180 (owner: 10Muehlenhoff)
[09:23:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P72450 and previous config saved to /var/cache/conftool/dbconfig/20250127-092340-marostegui.json
[09:25:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: send sigkill as needed to stateless components [puppet] - 10https://gerrit.wikimedia.org/r/1114336 (https://phabricator.wikimedia.org/T383570)
[09:25:39] <wikibugs>	 (03PS3) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000
[09:27:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72451 and previous config saved to /var/cache/conftool/dbconfig/20250127-092701-root.json
[09:27:40] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10495902 (10jcrespo) a:03jcrespo
[09:27:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet
[09:29:22] <wikibugs>	 (03CR) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm)
[09:32:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet
[09:32:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm)
[09:32:14] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Fix lvs::realserver::pools config for text and upload [puppet] - 10https://gerrit.wikimedia.org/r/1114337
[09:33:45] <wikibugs>	 (03CR) 10Jelto: "to comments in-line" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 (owner: 10JMeybohm)
[09:35:16] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114337 (owner: 10Vgutierrez)
[09:38:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P72452 and previous config saved to /var/cache/conftool/dbconfig/20250127-093847-marostegui.json
[09:40:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[09:40:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] benthos: add nocookies and tls session metadata [puppet] - 10https://gerrit.wikimedia.org/r/1112248 (https://phabricator.wikimedia.org/T383900) (owner: 10Filippo Giunchedi)
[09:41:16] <wikibugs>	 (03PS4) 10JMeybohm: CI: Fix helm errors hiding behind YAML parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980
[09:41:30] <wikibugs>	 (03CR) 10JMeybohm: CI: Fix helm errors hiding behind YAML parser (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 (owner: 10JMeybohm)
[09:42:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72453 and previous config saved to /var/cache/conftool/dbconfig/20250127-094206-root.json
[09:47:32] <moritzm>	 !log reimaging rpki1001 to bookworm
[09:47:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host rpki1001.eqiad.wmnet with OS bookworm
[09:49:21] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] No longer import prometheus-mysqld-exporter from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1111269 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[09:49:30] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm as far as I can tell" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 (owner: 10JMeybohm)
[09:50:16] <wikibugs>	 (03CR) 10Fabfur: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[09:50:27] <wikibugs>	 06SRE, 13Patch-For-Review: Add x-analytics nocookie=1 and x-tls-sess to webrequest-sampled-live stream - https://phabricator.wikimedia.org/T383900#10496038 (10fgiunchedi) `tls_sess` and `nocookies` fields are now part of `webrequest_sampled` topic!
[09:53:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T384592)', diff saved to https://phabricator.wikimedia.org/P72454 and previous config saved to /var/cache/conftool/dbconfig/20250127-095354-marostegui.json
[09:53:59] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[09:54:10] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[09:54:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T384592)', diff saved to https://phabricator.wikimedia.org/P72455 and previous config saved to /var/cache/conftool/dbconfig/20250127-095416-marostegui.json
[09:55:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:04:02] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cr[1-2]-magru,cr[1-2]-magru IPv6 with reason: upgrading JunOS on magru core routers
[10:04:26] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[10:05:12] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "forgot to actually hit +1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113979 (owner: 10JMeybohm)
[10:07:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet
[10:09:08] <wikibugs>	 (03PS1) 10Jcrespo: admin: Add neslihanturan to the list of privileged LDAP-only users [puppet] - 10https://gerrit.wikimedia.org/r/1114341 (https://phabricator.wikimedia.org/T384017)
[10:09:27] <wikibugs>	 06SRE, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10496117 (10Volans) >  I believe SRE are instead using their own private channel.  It's `#wikimedia-sre` and it's a public channel (as mention...
[10:09:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2005.wikimedia.org
[10:11:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1114341 (https://phabricator.wikimedia.org/T384017) (owner: 10Jcrespo)
[10:11:59] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore[1007,1009].eqiad.wmnet with reason: Index rebuild + upgrade
[10:13:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2005.wikimedia.org
[10:14:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T384592)', diff saved to https://phabricator.wikimedia.org/P72456 and previous config saved to /var/cache/conftool/dbconfig/20250127-101401-marostegui.json
[10:14:06] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[10:16:46] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] admin: Add neslihanturan to the list of privileged LDAP-only users [puppet] - 10https://gerrit.wikimedia.org/r/1114341 (https://phabricator.wikimedia.org/T384017) (owner: 10Jcrespo)
[10:16:54] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore[1007,1009].eqiad.wmnet with reason: Index rebuild + upgrade
[10:18:59] <wikibugs>	 06SRE, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10496179 (10hashar)
[10:20:06] <topranks>	 !log installing updated JunOS image on cr2-magru T384774
[10:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:10] <stashbot>	 T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774
[10:20:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2004.wikimedia.org
[10:23:15] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10496200 (10jcrespo) 05Open→03Resolved Your account, @Neslihan_Turan_WMDE, already appears as a member of the NDA and WMDE groups: https://ldap.toolforge....
[10:24:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2004.wikimedia.org
[10:25:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:26:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1024 T384820', diff saved to https://phabricator.wikimedia.org/P72457 and previous config saved to /var/cache/conftool/dbconfig/20250127-102657-marostegui.json
[10:27:02] <stashbot>	 T384820: decommission es1024.eqiad.wmnet - https://phabricator.wikimedia.org/T384820
[10:27:31] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[10:27:43] <wikibugs>	 (03PS1) 10Marostegui: es1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114342 (https://phabricator.wikimedia.org/T384820)
[10:28:03] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[10:28:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1114342 (https://phabricator.wikimedia.org/T384820) (owner: 10Marostegui)
[10:29:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P72458 and previous config saved to /var/cache/conftool/dbconfig/20250127-102908-marostegui.json
[10:29:59] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: add haproxykafka to esams [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[10:31:45] <wikibugs>	 (03CR) 10Fabfur: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[10:31:47] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: add haproxykafka to esams [puppet] - 10https://gerrit.wikimedia.org/r/1114329 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[10:33:44] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host rpki1001.eqiad.wmnet with OS bookworm
[10:34:05] <fabfur>	 !log installing haproxykafka on esams (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114329) (T378578)
[10:34:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:09] <stashbot>	 T378578: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578
[10:34:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host rpki1001.eqiad.wmnet with OS bookworm
[10:35:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[10:35:47] <wikibugs>	 (03CR) 10Klausman: "Thanks a ton for your work on this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114012 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[10:36:28] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host rpki1001.eqiad.wmnet with OS bookworm
[10:36:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host rpki1001.eqiad.wmnet with OS bookworm
[10:37:46] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1171.eqiad.wmnet with reason: reimage
[10:38:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1004.wikimedia.org
[10:40:07] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[10:40:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:40:48] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:42:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1004.wikimedia.org
[10:42:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet
[10:42:23] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496316 (10ops-monitoring-bot) Draining ganeti2025.codfw.wmnet of running VMs
[10:43:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti2025 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114346
[10:43:43] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1171.eqiad.wmnet with OS bookworm
[10:43:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet
[10:43:58] <vgutierrez>	 !log testing pybal 1.15.15 in lvs4010
[10:44:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P72459 and previous config saved to /var/cache/conftool/dbconfig/20250127-104415-marostegui.json
[10:47:38] <topranks>	 !log rebooting cr2-magru to complete upgrade T384774
[10:47:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:42] <stashbot>	 T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774
[10:50:22] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:50:28] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:50:40] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1113478 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[10:50:48] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:50:54] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:54:48] <topranks>	 ^^ this is due to cr2-magru rebooting all ok 
[10:54:48] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:54:48] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:56:32] <wikibugs>	 06SRE, 06DBA, 13Patch-For-Review: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10496376 (10Marostegui) Tables rebuilt, host catching up.
[10:56:43] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on asw1-b[3-4]-magru.mgmt with reason: upgrading JunOS on magru core routers
[10:58:18] <Lucas_WMDE>	 FTR, I probably won’t be able to do the UTC afternoon backport window today
[10:58:23] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:58:30] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:58:50] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:58:54] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:59:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T384592)', diff saved to https://phabricator.wikimedia.org/P72460 and previous config saved to /var/cache/conftool/dbconfig/20250127-105922-marostegui.json
[10:59:27] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[10:59:38] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[10:59:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T384592)', diff saved to https://phabricator.wikimedia.org/P72461 and previous config saved to /var/cache/conftool/dbconfig/20250127-105944-marostegui.json
[10:59:48] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2185.codfw.wmnet
[10:59:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix preseed pattern for cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1114349
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1100)
[11:00:09] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix preseed pattern for cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1114349
[11:04:13] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host rpki1001.eqiad.wmnet with OS bookworm
[11:04:26] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2185.codfw.wmnet
[11:08:38] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:09:07] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1114349 (owner: 10Muehlenhoff)
[11:11:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix preseed pattern for cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1114349 (owner: 10Muehlenhoff)
[11:14:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet
[11:14:57] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2025.codfw.wmnet
[11:17:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host rpki1001.eqiad.wmnet with OS bookworm
[11:19:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T384592)', diff saved to https://phabricator.wikimedia.org/P72462 and previous config saved to /var/cache/conftool/dbconfig/20250127-111924-marostegui.json
[11:19:29] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[11:19:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to drbd
[11:19:57] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] drivers.py: add container_limits to the Docker driver (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey)
[11:20:15] <topranks>	 !log installing updated JunOS image on cr1-magru T384774
[11:20:16] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496490 (10ops-monitoring-bot) VM kubestagemaster2003.codfw.wmnet switching disk type to drbd
[11:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:19] <stashbot>	 T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774
[11:23:54] <wikibugs>	 (03PS1) 10Gergő Tisza: Add machine-readable markings for SUL3 extension denylist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114351
[11:24:06] <wikibugs>	 (03PS1) 10Vgutierrez: wmflib,pybal: Add scheduler_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027)
[11:24:47] <wikibugs>	 (03CR) 10TChin: [C:03+2] mw-content-history-reconcile-enrich: Add HA storageDir and Ceph egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109448 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[11:25:11] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[11:26:10] <wikibugs>	 (03Merged) 10jenkins-bot: mw-content-history-reconcile-enrich: Add HA storageDir and Ceph egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109448 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[11:26:28] <effie>	 jouncebot: now
[11:26:28] <jouncebot>	 For the next 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1100)
[11:27:41] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): scale next to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114005 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[11:27:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on rpki1001.eqiad.wmnet with reason: host reimage
[11:29:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10496529 (10cmooney)
[11:29:38] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mw-(api-ext|web): scale next to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114005 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[11:30:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10496532 (10cmooney)
[11:30:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mw-(api-ext|web): scale next to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114005 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[11:31:18] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-ext|web): scale next to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114005 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[11:32:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rpki1001.eqiad.wmnet with reason: host reimage
[11:33:03] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[11:33:09] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[11:34:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P72463 and previous config saved to /var/cache/conftool/dbconfig/20250127-113431-marostegui.json
[11:34:46] <topranks>	 !log rebooting cr1-magru to complete upgrade T384774
[11:34:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:50] <stashbot>	 T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774
[11:35:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to drbd
[11:35:41] <icinga-wm>	 PROBLEM - Host kubestagemaster2003 is DOWN: PING CRITICAL - Packet loss = 100%
[11:35:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet
[11:36:18] <moritzm>	 ^ expected, temporarily changing disk image to reimage a ganeti node
[11:36:25] <icinga-wm>	 RECOVERY - Host kubestagemaster2003 is UP: PING OK - Packet loss = 0%, RTA = 30.79 ms
[11:36:26] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496565 (10ops-monitoring-bot) Draining ganeti2025.codfw.wmnet of running VMs
[11:36:28] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[11:36:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:37:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet
[11:37:28] <wikibugs>	 (03PS2) 10Vgutierrez: wmflib,pybal: Add scheduler_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027)
[11:37:45] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:37:45] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:37:47] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10496573 (10MoritzMuehlenhoff)
[11:38:31] <topranks>	 ^^ these are due to cr1-magru reboot 
[11:38:37] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_gnmic.service on netflow1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:38:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to plain
[11:39:18] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496584 (10ops-monitoring-bot) VM kubestagemaster2003.codfw.wmnet switching disk type to plain
[11:39:18] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[11:39:19] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1215.eqiad.wmnet
[11:39:31] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[11:39:42] <marostegui>	 !log Upgrade and reboot zarcillo/orchestrator database db1215
[11:39:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to plain
[11:40:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet
[11:41:07] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496608 (10ops-monitoring-bot) Draining ganeti2025.codfw.wmnet of running VMs
[11:41:43] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[11:41:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:41:59] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[11:44:35] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez)
[11:44:49] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:44:53] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[11:45:06] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1215.eqiad.wmnet
[11:45:49] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:45:58] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[11:46:10] <wikibugs>	 (03PS1) 10Kamila Součková: wikikube: rename parse10[18-24] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1114354 (https://phabricator.wikimedia.org/T365571)
[11:47:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rpki1001.eqiad.wmnet with OS bookworm
[11:49:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P72465 and previous config saved to /var/cache/conftool/dbconfig/20250127-114938-marostegui.json
[11:50:15] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[11:50:15] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[11:51:31] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[11:52:14] <wikibugs>	 (03Merged) 10jenkins-bot: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[11:52:22] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10496659 (10Papaul)
[11:52:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 10%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72466 and previous config saved to /var/cache/conftool/dbconfig/20250127-115239-root.json
[11:52:44] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[11:53:22] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10496664 (10Papaul) @Jhancock.wm you can move ganeti2020 anytime today. Once done just ping @MoritzMuehlenhoff ....
[11:53:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:55:35] <wikibugs>	 (03PS1) 10Vgutierrez: service: Add scheduler_flag field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/1114356 (https://phabricator.wikimedia.org/T373027)
[11:56:30] <logmsgbot>	 !log root@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db1171.eqiad.wmnet with OS bookworm
[11:58:00] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1171.eqiad.wmnet with OS bookworm
[12:02:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2020 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113963 (owner: 10Muehlenhoff)
[12:04:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T384592)', diff saved to https://phabricator.wikimedia.org/P72467 and previous config saved to /var/cache/conftool/dbconfig/20250127-120445-marostegui.json
[12:04:50] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[12:05:01] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[12:05:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T384592)', diff saved to https://phabricator.wikimedia.org/P72468 and previous config saved to /var/cache/conftool/dbconfig/20250127-120507-marostegui.json
[12:06:39] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez)
[12:07:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 25%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72469 and previous config saved to /var/cache/conftool/dbconfig/20250127-120744-root.json
[12:07:49] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[12:08:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:08:58] <moritzm>	 !log installing git-lfs security updates
[12:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:39] <wikibugs>	 (03CR) 10Vgutierrez: "`swift` and `swift-https` services are the only services defined on `hieradata/common/service.yaml` from the LVS PoV. `swift-https` LVS se" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez)
[12:12:41] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1171.eqiad.wmnet with reason: host reimage
[12:15:27] <effie>	 !jouncebot now
[12:15:27] <wm-bot>	 a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot
[12:15:32] <effie>	 !jouncebot next
[12:15:32] <wm-bot>	 a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot
[12:16:39] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: host reimage
[12:17:22] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114365
[12:18:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2227 T384807', diff saved to https://phabricator.wikimedia.org/P72470 and previous config saved to /var/cache/conftool/dbconfig/20250127-121843-marostegui.json
[12:18:48] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[12:19:01] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2227.codfw.wmnet
[12:21:45] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, this can be merged anytime as the new property has a default value" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1114356 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez)
[12:22:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 50%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72471 and previous config saved to /var/cache/conftool/dbconfig/20250127-122249-root.json
[12:23:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T384592)', diff saved to https://phabricator.wikimedia.org/P72472 and previous config saved to /var/cache/conftool/dbconfig/20250127-122320-marostegui.json
[12:23:25] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[12:25:02] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2227.codfw.wmnet
[12:27:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] No longer import prometheus-mysqld-exporter from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1111269 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[12:29:12] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2227.codfw.wmnet with reason: Index rebuild
[12:31:21] <wikibugs>	 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10496715 (10RobH)
[12:37:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 75%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72473 and previous config saved to /var/cache/conftool/dbconfig/20250127-123754-root.json
[12:38:00] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[12:38:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P72474 and previous config saved to /var/cache/conftool/dbconfig/20250127-123827-marostegui.json
[12:39:15] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1171.eqiad.wmnet with OS bookworm
[12:50:39] <jinxer-wm>	 RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1096-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:53:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 100%: Repooling T384807', diff saved to https://phabricator.wikimedia.org/P72476 and previous config saved to /var/cache/conftool/dbconfig/20250127-125301-root.json
[12:53:05] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114375
[12:53:06] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[12:53:10] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1114375 (owner: 10Marostegui)
[12:53:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P72477 and previous config saved to /var/cache/conftool/dbconfig/20250127-125334-marostegui.json
[12:54:05] <wikibugs>	 (03PS1) 10Btullis: Add the service_proxy to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329)
[12:54:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add the service_proxy to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[12:55:27] <wikibugs>	 (03PS2) 10Btullis: Add the service_proxy to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329)
[12:56:55] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4862/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[12:57:31] <wikibugs>	 (03PS2) 10Anzx: srwiki: add incubator as importsource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114378 (https://phabricator.wikimedia.org/T384069)
[12:57:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114378 (https://phabricator.wikimedia.org/T384069) (owner: 10Anzx)
[12:58:12] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 10observability, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856#10496789 (10fgiunchedi) Something that occurred to me while looking at {T366710}: with mw-to-k8s we ar...
[13:00:30] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 1200MiB (0% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[13:01:33] <wikibugs>	 (03PS3) 10Anzx: enwiki: temporary lift of IP cap for 31 January and 1 February 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114372 (https://phabricator.wikimedia.org/T384680)
[13:01:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114372 (https://phabricator.wikimedia.org/T384680) (owner: 10Anzx)
[13:07:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet
[13:08:24] <wikibugs>	 (03CR) 10Elukey: drivers.py: add container_limits to the Docker driver (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey)
[13:08:32] <wikibugs>	 (03CR) 10Elukey: [C:03+2] drivers.py: add container_limits to the Docker driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey)
[13:08:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T384592)', diff saved to https://phabricator.wikimedia.org/P72478 and previous config saved to /var/cache/conftool/dbconfig/20250127-130841-marostegui.json
[13:08:46] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[13:08:46] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[13:10:36] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[13:10:57] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] dsh: empty scap proxy list [puppet] - 10https://gerrit.wikimedia.org/r/1112714 (https://phabricator.wikimedia.org/T384196) (owner: 10Hnowlan)
[13:11:00] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[13:12:26] <wikibugs>	 (03Merged) 10jenkins-bot: drivers.py: add container_limits to the Docker driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey)
[13:13:21] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10496815 (10elukey)
[13:13:30] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2025.codfw.wmnet with reason: remove from cluster for reimage
[13:13:34] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496817 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4302b551-98b7-475e-9fb4-959f5c56a6cc) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[13:14:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2025 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1114346 (owner: 10Muehlenhoff)
[13:15:08] <wikibugs>	 (03CR) 10Marostegui: Revert "db2182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114375 (owner: 10Marostegui)
[13:15:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1114375 (owner: 10Marostegui)
[13:15:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 10%: Repooling T384801', diff saved to https://phabricator.wikimedia.org/P72479 and previous config saved to /var/cache/conftool/dbconfig/20250127-131554-root.json
[13:15:59] <stashbot>	 T384801: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801
[13:16:11] <wikibugs>	 06SRE, 06DBA, 13Patch-For-Review: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801#10496825 (10Marostegui) 05Open→03Resolved Host being repooled automatically.
[13:18:50] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2140.codfw.wmnet
[13:23:44] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.dns.netbox
[13:23:54] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "🥳" [puppet] - 10https://gerrit.wikimedia.org/r/1114354 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[13:25:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2025.codfw.wmnet with OS bookworm
[13:26:02] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10496842 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bookworm
[13:26:44] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1013.eqiad.wmnet with reason: host reimage
[13:27:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[13:27:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1157 T384807', diff saved to https://phabricator.wikimedia.org/P72480 and previous config saved to /var/cache/conftool/dbconfig/20250127-132710-marostegui.json
[13:27:15] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[13:27:32] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1157.eqiad.wmnet
[13:27:33] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2140.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002"
[13:28:00] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[13:28:00] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2140.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002"
[13:28:01] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:28:01] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2140.codfw.wmnet
[13:28:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T384592)', diff saved to https://phabricator.wikimedia.org/P72481 and previous config saved to /var/cache/conftool/dbconfig/20250127-132806-marostegui.json
[13:28:14] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[13:28:41] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] site.pp, db2140.yaml: remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113967 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto)
[13:28:58] <wikibugs>	 (03PS2) 10Federico Ceratto: site.pp, db2140.yaml: remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113967 (https://phabricator.wikimedia.org/T384480)
[13:31:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 25%: Repooling T384801', diff saved to https://phabricator.wikimedia.org/P72482 and previous config saved to /var/cache/conftool/dbconfig/20250127-133059-root.json
[13:31:06] <stashbot>	 T384801: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801
[13:31:18] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] site.pp, db2140.yaml: remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113967 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto)
[13:32:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[13:32:25] <moritzm>	 !log installing runc security updates on bullseye
[13:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:31] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1013.eqiad.wmnet with reason: host reimage
[13:34:04] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1157.eqiad.wmnet
[13:34:53] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] enwiki: Release Add Link to 10% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114379 (https://phabricator.wikimedia.org/T384551)
[13:34:54] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Index rebuild
[13:35:46] <federico3>	 !log Removing db2140 from zarcillo T384480
[13:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:50] <stashbot>	 T384480: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480
[13:36:32] <wikibugs>	 (03PS1) 10TChin: mw-content-history-reconcile-enrich: Use full kafka test fqdn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114380 (https://phabricator.wikimedia.org/T375176)
[13:38:24] <wikibugs>	 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10496896 (10RobH) Summary of case updates since 22nd: * Dell opened the case and requested the TSR which I couldn't attach due to it being 22MB, so the...
[13:39:03] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] kserve: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114012 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[13:39:38] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480#10496900 (10FCeratto-WMF)
[13:40:47] <wikibugs>	 (03CR) 10Btullis: "If it's a temporary workaround, could we not add it to the search instance alone?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse)
[13:41:19] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] mw-content-history-reconcile-enrich: Use full kafka test fqdn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114380 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[13:41:38] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480#10496903 (10FCeratto-WMF) The host is ready for the DC-Ops team to decommission.
[13:43:21] <wikibugs>	 (03CR) 10TChin: [C:03+2] mw-content-history-reconcile-enrich: Use full kafka test fqdn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114380 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[13:43:42] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "looks reasonable. The audit log should tell if you missed anything." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114016 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[13:44:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: query_service: clean up icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1114381 (https://phabricator.wikimedia.org/T358029)
[13:44:36] <wikibugs>	 (03Merged) 10jenkins-bot: mw-content-history-reconcile-enrich: Use full kafka test fqdn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114380 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[13:46:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 50%: Repooling T384801', diff saved to https://phabricator.wikimedia.org/P72483 and previous config saved to /var/cache/conftool/dbconfig/20250127-134605-root.json
[13:46:12] <stashbot>	 T384801: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801
[13:46:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T384592)', diff saved to https://phabricator.wikimedia.org/P72484 and previous config saved to /var/cache/conftool/dbconfig/20250127-134650-marostegui.json
[13:46:55] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[13:46:58] <wikibugs>	 (03PS1) 10Dreamrimmer: Changed default license for Wikinews to CC-BY-4.0 and for fawikinews and svwikinews to CC-BY-SA-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114382 (https://phabricator.wikimedia.org/T384614)
[13:47:26] <wikibugs>	 (03CR) 10DCausse: "I think the problem is not only affecting search, but all airflow instances on my side I'll never hit the 'Trigger DAG' again and rely on " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse)
[13:48:12] <wikibugs>	 (03PS1) 10Brouberol: envoy: define an mw-misc service mesh entry [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329)
[13:49:48] <wikibugs>	 (03PS2) 10Brouberol: envoy: define an mw-misc service mesh entry [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329)
[13:50:46] <wikibugs>	 (03PS1) 10Andrew Bogott: Updates for cloudcephosd1013: puppet 7 + Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1114384
[13:51:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Updates for cloudcephosd1013: puppet 7 + Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1114384 (owner: 10Andrew Bogott)
[13:53:03] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[13:53:17] <wikibugs>	 (03CR) 10JMeybohm: envoy: define an mw-misc service mesh entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[13:53:23] <logmsgbot>	 !log gmodena@deploy2002 Started deploy [airflow-dags/search@3c004c1]: syncing artifacts
[13:53:25] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[13:53:31] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[13:53:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114382 (https://phabricator.wikimedia.org/T384614) (owner: 10Dreamrimmer)
[13:53:45] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[13:53:52] <logmsgbot>	 !log gmodena@deploy2002 Finished deploy [airflow-dags/search@3c004c1]: syncing artifacts (duration: 01m 04s)
[13:56:17] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Yep, makes sense, nice catch and thanks for fixing it." [puppet] - 10https://gerrit.wikimedia.org/r/1114337 (owner: 10Vgutierrez)
[13:58:44] <wikibugs>	 (03CR) 10Ssingh: "Thanks for the patch! I propose that we either do this for all single-backend sites (profile::cache::varnish::frontend::single_backend: tr" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[13:59:47] <wikibugs>	 (03PS3) 10Brouberol: envoy: define an mw-misc service mesh entry [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329)
[13:59:55] <wikibugs>	 (03CR) 10Brouberol: envoy: define an mw-misc service mesh entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:00:13] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1400).
[14:00:13] <jouncebot>	 toni_, anzx, and DreamRimmer: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:23] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] [Growth] enwiki: Release Add Link to 10% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114379 (https://phabricator.wikimedia.org/T384551) (owner: 10Urbanecm)
[14:00:23] <anzx>	 o/
[14:00:33] <toni_>	 here
[14:01:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 75%: Repooling T384801', diff saved to https://phabricator.wikimedia.org/P72485 and previous config saved to /var/cache/conftool/dbconfig/20250127-140111-root.json
[14:01:16] <stashbot>	 T384801: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801
[14:01:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P72486 and previous config saved to /var/cache/conftool/dbconfig/20250127-140157-marostegui.json
[14:02:56] <Lucas_WMDE>	 I can’t deploy today, sorry
[14:04:52] <wikibugs>	 (03PS1) 10Slyngshede: Upgrade to CAS 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1114388
[14:06:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2025.codfw.wmnet with reason: host reimage
[14:07:23] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] envoy: define an mw-misc service mesh entry [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:07:26] <zabe>	 I can
[14:07:40] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Add ios.article_link_interaction stream to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113996 (https://phabricator.wikimedia.org/T382031) (owner: 10Tsevener)
[14:07:56] <wikibugs>	 (03CR) 10Zabe: [C:03+2] srwiki: add incubator as importsource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114378 (https://phabricator.wikimedia.org/T384069) (owner: 10Anzx)
[14:08:30] <wikibugs>	 (03CR) 10Zabe: [C:03+2] enwiki: temporary lift of IP cap for 31 January and 1 February 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114372 (https://phabricator.wikimedia.org/T384680) (owner: 10Anzx)
[14:09:11] <wikibugs>	 (03Merged) 10jenkins-bot: Add ios.article_link_interaction stream to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113996 (https://phabricator.wikimedia.org/T382031) (owner: 10Tsevener)
[14:09:13] <wikibugs>	 (03Merged) 10jenkins-bot: srwiki: add incubator as importsource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114378 (https://phabricator.wikimedia.org/T384069) (owner: 10Anzx)
[14:09:15] <wikibugs>	 (03Merged) 10jenkins-bot: enwiki: temporary lift of IP cap for 31 January and 1 February 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114372 (https://phabricator.wikimedia.org/T384680) (owner: 10Anzx)
[14:09:52] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1113996|Add ios.article_link_interaction stream to config (T382031)]], [[gerrit:1114378|srwiki: add incubator as importsource (T384069)]], [[gerrit:1114372|enwiki: temporary lift of IP cap for 31 January and 1 February 2025 (T384680)]]
[14:09:59] <stashbot>	 T382031: Track impressions for article views - https://phabricator.wikimedia.org/T382031
[14:10:00] <stashbot>	 T384069: Add an import source for "Special:Import" on sr.wiki - https://phabricator.wikimedia.org/T384069
[14:10:00] <stashbot>	 T384680: Requesting temporary lift of IP cap for 31 January and 1 February 2025 - https://phabricator.wikimedia.org/T384680
[14:10:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2025.codfw.wmnet with reason: host reimage
[14:10:38] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1013.eqiad.wmnet with reason: host reimage
[14:11:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] envoy: define an mw-misc service mesh entry [puppet] - 10https://gerrit.wikimedia.org/r/1114383 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol)
[14:12:25] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] "Seems harmless if it helps you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114351 (owner: 10Gergő Tisza)
[14:13:49] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1013.eqiad.wmnet with reason: host reimage
[14:14:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1114388 (owner: 10Slyngshede)
[14:16:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2182 (re)pooling @ 100%: Repooling T384801', diff saved to https://phabricator.wikimedia.org/P72487 and previous config saved to /var/cache/conftool/dbconfig/20250127-141616-root.json
[14:16:20] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] "This is really cool.  We should add this to all analytics clients, including stat boxes!  That way airflow dev envs can use the same URLs " [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[14:16:21] <stashbot>	 T384801: db2182 depooled (Errmsg: Error Index for table recentchanges is corrupt) - https://phabricator.wikimedia.org/T384801
[14:17:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P72488 and previous config saved to /var/cache/conftool/dbconfig/20250127-141704-marostegui.json
[14:17:49] <wikibugs>	 (03CR) 10Fabfur: liberica: Add katran config settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez)
[14:18:03] <zabe>	 it is slow
[14:18:16] <toni_>	 np
[14:21:39] <logmsgbot>	 !log zabe@deploy2002 tsev, zabe, anzx: Backport for [[gerrit:1113996|Add ios.article_link_interaction stream to config (T382031)]], [[gerrit:1114378|srwiki: add incubator as importsource (T384069)]], [[gerrit:1114372|enwiki: temporary lift of IP cap for 31 January and 1 February 2025 (T384680)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:21:42] <zabe>	 toni_: anzx: can you test your patches?
[14:21:45] <stashbot>	 T382031: Track impressions for article views - https://phabricator.wikimedia.org/T382031
[14:21:45] <stashbot>	 T384069: Add an import source for "Special:Import" on sr.wiki - https://phabricator.wikimedia.org/T384069
[14:21:46] <stashbot>	 T384680: Requesting temporary lift of IP cap for 31 January and 1 February 2025 - https://phabricator.wikimedia.org/T384680
[14:21:52] <anzx>	 zabe: import source looks ok, nothing to test on throttle 
[14:22:36] <toni_>	 looks good to me
[14:22:39] <zabe>	 alright
[14:22:43] <logmsgbot>	 !log zabe@deploy2002 tsev, zabe, anzx: Continuing with sync
[14:24:49] <zabe>	 DreamRimmer: around?
[14:24:58] <DreamRimmer>	 yes
[14:25:34] <wikibugs>	 (03PS25) 10Bking: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[14:27:12] <zabe>	 DreamRimmer: the rfc etc states pretty specific dates for the switch (1st of Feb / 30th of Jan). would you say it is okay to already do it today?
[14:28:09] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse[1018-1024].eqiad.wmnet
[14:28:18] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] wikikube: rename parse10[18-24] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1114354 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[14:28:37] <DreamRimmer>	 I don't see any issue
[14:29:22] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Fix lvs::realserver::pools config for text and upload [puppet] - 10https://gerrit.wikimedia.org/r/1114337 (owner: 10Vgutierrez)
[14:31:00] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[14:31:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: base: absent check_microcode [puppet] - 10https://gerrit.wikimedia.org/r/1114391 (https://phabricator.wikimedia.org/T350694)
[14:31:55] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Changed default license for Wikinews to CC-BY-4.0 and for fawikinews and svwikinews to CC-BY-SA-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114382 (https://phabricator.wikimedia.org/T384614) (owner: 10Dreamrimmer)
[14:32:03] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Increase revision-slots cache expiry back to default for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114060 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[14:32:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T384592)', diff saved to https://phabricator.wikimedia.org/P72489 and previous config saved to /var/cache/conftool/dbconfig/20250127-143211-marostegui.json
[14:32:16] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[14:32:20] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse[1018-1024].eqiad.wmnet
[14:32:26] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[14:32:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2025.codfw.wmnet with OS bookworm
[14:32:52] <wikibugs>	 (03Merged) 10jenkins-bot: Changed default license for Wikinews to CC-BY-4.0 and for fawikinews and svwikinews to CC-BY-SA-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114382 (https://phabricator.wikimedia.org/T384614) (owner: 10Dreamrimmer)
[14:32:59] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10497129 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bookworm completed: - ganeti202...
[14:33:07] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1018 to wikikube-worker1159
[14:33:27] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[14:33:40] <wikibugs>	 (03Merged) 10jenkins-bot: Increase revision-slots cache expiry back to default for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114060 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[14:34:02] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480#10497132 (10Marostegui) a:05FCeratto-WMF→03None
[14:34:27] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv
[14:34:27] <icinga-wm>	 e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:34:27] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv
[14:34:27] <icinga-wm>	 e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:35:30] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578)
[14:35:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[14:35:53] <wikibugs>	 (03PS2) 10Scott French: mw-on-k8s: aggregate remaining alerts by release name [alerts] - 10https://gerrit.wikimedia.org/r/1114018 (https://phabricator.wikimedia.org/T384532)
[14:35:57] <wikibugs>	 (03CR) 10Fabfur: [C:04-1] "Do not merge until 28/01/2025" [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[14:36:36] <wikibugs>	 (03PS2) 10Fabfur: hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet
[14:36:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[14:37:27] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner
[14:37:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1018 to wikikube-worker1159 - kamila@cumin1002"
[14:37:38] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1019 to wikikube-worker1160
[14:37:44] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1018 to wikikube-worker1159 - kamila@cumin1002"
[14:37:44] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:37:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1159
[14:37:45] <wikibugs>	 (03PS3) 10Fabfur: hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578)
[14:37:58] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[14:38:01] <zabe>	 for some reason the number of left k8s nodes is increasing
[14:38:04] <zabe>	 curious
[14:38:34] <wikibugs>	 (03CR) 10Fabfur: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[14:38:46] <wikibugs>	 (03CR) 10Klausman: [C:03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114016 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[14:38:47] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1159
[14:39:25] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1018 to wikikube-worker1159
[14:39:43] <zabe>	 ok, aborting
[14:40:19] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: T384614 T183490
[14:40:25] <stashbot>	 T384614: Change of default license for Wikinews to CC-BY-4.0 and for fawikinews and svwikinews to CC-BY-SA-4.0 on January 30, 2025 - https://phabricator.wikimedia.org/T384614
[14:40:25] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[14:41:44] <wikibugs>	 (03CR) 10Xcollazo: "Excuse my ignorance, but will this also allow us to hit endpoints like "https://noc.wikimedia.org/conf/dblists/open.dblist" ?" [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[14:43:09] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1019 to wikikube-worker1160 - kamila@cumin1002"
[14:43:25] <zabe>	 DreamRimmer: can you test your change?
[14:43:30] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1020 to wikikube-worker1161
[14:43:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1019 to wikikube-worker1160 - kamila@cumin1002"
[14:43:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:43:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1160
[14:43:41] <DreamRimmer>	 checking
[14:43:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[14:44:02] <logmsgbot>	 !log zabe@deploy2002 zabe: T384614 T183490 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:44:40] <jinxer-wm>	 FIRING: [3x] KubernetesRsyslogDown: rsyslog on parse1021:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:45:11] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1160
[14:45:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet
[14:45:33] <DreamRimmer>	 look good to me
[14:45:37] <zabe>	 alright
[14:45:38] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[14:45:50] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1019 to wikikube-worker1160
[14:47:21] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1020 to wikikube-worker1161 - kamila@cumin1002"
[14:47:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1021 to wikikube-worker1162
[14:47:37] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1020 to wikikube-worker1161 - kamila@cumin1002"
[14:47:37] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:47:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1161
[14:47:53] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[14:48:43] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1161
[14:49:22] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1020 to wikikube-worker1161
[14:49:40] <jinxer-wm>	 FIRING: [3x] KubernetesRsyslogDown: rsyslog on parse1022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:50:05] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[14:50:11] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[14:50:13] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[14:50:20] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[345] - https://phabricator.wikimedia.org/T384838 (10RobH) 03NEW
[14:50:47] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[14:51:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1021 to wikikube-worker1162 - kamila@cumin1002"
[14:51:35] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[14:51:38] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1022 to wikikube-worker1163
[14:51:38] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[14:51:43] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1021 to wikikube-worker1162 - kamila@cumin1002"
[14:51:43] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:51:43] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1162
[14:51:46] <toni_>	 sorry, connection dropped. Looks good in prod, thanks for deploying zabe
[14:51:58] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[14:52:09] <zabe>	 yw
[14:52:12] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[345] - https://phabricator.wikimedia.org/T384838#10497221 (10RobH) a:03MoritzMuehlenhoff @MoritzMuehlenhoff,  I didn't want to hold up the ordering of parent task T382898 so I've escalated that (with Joanna's approval) and...
[14:52:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1162
[14:52:53] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[14:52:58] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[14:53:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "webperf: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1114393
[14:53:28] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1021 to wikikube-worker1162
[14:53:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10497248 (10Papaul) @JMeybohm can we do this today? if not please let me know when will be a good d...
[14:54:01] <wikibugs>	 (03PS26) 10Bking: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[14:54:11] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[14:54:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4863/console" [puppet] - 10https://gerrit.wikimedia.org/r/1103318 (owner: 10Muehlenhoff)
[14:55:24] <wikibugs>	 (03PS27) 10Bking: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[14:55:41] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1022 to wikikube-worker1163 - kamila@cumin1002"
[14:55:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1023 to wikikube-worker1164
[14:55:54] <wikibugs>	 (03PS1) 10Phuedx: testwiki: Enable MetricsPlatform experiment enrollment overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728)
[14:55:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1022 to wikikube-worker1163 - kamila@cumin1002"
[14:55:57] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:55:57] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1163
[14:56:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[14:56:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4864/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114393 (owner: 10Giuseppe Lavagetto)
[14:56:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10497264 (10kamila) a:03VRiley-WMF
[14:56:57] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[3-8] - https://phabricator.wikimedia.org/T384838#10497267 (10RobH)
[14:57:26] <logmsgbot>	 !log zabe@deploy2002 sync-world aborted: T384614 T183490 (duration: 17m 07s)
[14:57:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1163
[14:57:32] <stashbot>	 T384614: Change of default license for Wikinews to CC-BY-4.0 and for fawikinews and svwikinews to CC-BY-SA-4.0 on January 30, 2025 - https://phabricator.wikimedia.org/T384614
[14:57:32] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[14:57:52] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+1] Revert "webperf: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1114393 (owner: 10Giuseppe Lavagetto)
[14:58:08] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1022 to wikikube-worker1163
[14:58:29] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[14:58:56] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Revert "webperf: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1114393 (https://phabricator.wikimedia.org/T384836)
[14:59:16] <zabe>	 ok, it formally aborted, but it reached all k8s nodes
[14:59:35] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1023 to wikikube-worker1164 - kamila@cumin1002"
[14:59:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] Revert "webperf: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1114393 (https://phabricator.wikimedia.org/T384836) (owner: 10Giuseppe Lavagetto)
[14:59:43] <swfrench-wmf>	 zabe: did the k8s production update part time out?
[14:59:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1024 to wikikube-worker1165
[14:59:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1023 to wikikube-worker1164 - kamila@cumin1002"
[14:59:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:59:56] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1164
[15:00:01] <swfrench-wmf>	 if so, I have theory as to why, which I'll follow up on shortly
[15:00:11] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:00:12] <jouncebot>	 swfrench-wmf and effie: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki infrastructure (UTC afternoon, one off) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1500).
[15:01:17] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1164
[15:01:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1023 to wikikube-worker1164
[15:02:31] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] thanos: send sigkill as needed to stateless components [puppet] - 10https://gerrit.wikimedia.org/r/1114336 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi)
[15:03:12] <wikibugs>	 (03PS1) 10Hnowlan: fc-list: update font list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114398 (https://phabricator.wikimedia.org/T280718)
[15:03:44] <swfrench-wmf>	 I'm here, and will get started in the next 10-15 minutes
[15:04:01] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1024 to wikikube-worker1165 - kamila@cumin1002"
[15:04:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1024 to wikikube-worker1165 - kamila@cumin1002"
[15:04:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:04:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1165
[15:06:01] <zabe>	 swfrench-wmf: not sure, the number of left k8s nodes went basically to almost 0 and then starting growing again
[15:06:22] <zabe>	 so my patches are probably not 100% deployed, but maybe like 98+%
[15:06:33] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1165
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:04] <zabe>	 if you want to, I can revert them, but on the other hand I would prefer if we could just try to fix that with another sync
[15:07:13] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1024 to wikikube-worker1165
[15:07:13] <swfrench-wmf>	 zabe: got it, thank you! yeah, I think that would be consistent with a timeout for one specific subset of the k8s deployments. indeed you're right though that the primary ones are fully updated.
[15:07:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1159.eqiad.wmnet wikikube-worker1160.eqiad.wmnet wikikube-worker1161.eqiad.wmnet wikikube-worker1162.eqiad.wmnet wikikube-worker1163.eqiad.wmnet wikikube-worker1164.eqiad.wmnet wikikube-worker1165.eqiad.wmnet on all recursors
[15:07:23] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1159.eqiad.wmnet wikikube-worker1160.eqiad.wmnet wikikube-worker1161.eqiad.wmnet wikikube-worker1162.eqiad.wmnet wikikube-worker1163.eqiad.wmnet wikikube-worker1164.eqiad.wmnet wikikube-worker1165.eqiad.wmnet on all recursors
[15:07:45] <swfrench-wmf>	 zabe: yeah, no need to revert - I'll take it from here :)
[15:07:57] <zabe>	 okay thanks:)
[15:08:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:08:40] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez)
[15:09:38] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl[1002-1003].eqiad.wmnet
[15:09:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10497343 (10ops-monitoring-bot) depool host wikikube-ctrl[1002-1003].eqiad.wmnet by jayme@cumin1002...
[15:09:52] <logmsgbot>	 !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl[1002-1003].eqiad.wmnet with reason: Depooled via sre.k8s.pool-depool-node
[15:09:55] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl[1002-1003].eqiad.wmnet
[15:10:00] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2157.codfw.wmnet with reason: Maintenance
[15:10:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10497344 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1...
[15:10:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72490 and previous config saved to /var/cache/conftool/dbconfig/20250127-151007-marostegui.json
[15:10:12] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[15:10:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1160.eqiad.wmnet with OS bookworm
[15:10:31] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1160
[15:10:31] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1160
[15:10:34] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1161.eqiad.wmnet with OS bookworm
[15:10:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1161
[15:10:37] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1161
[15:10:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1162.eqiad.wmnet with OS bookworm
[15:10:49] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1162
[15:10:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1162
[15:10:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1163.eqiad.wmnet with OS bookworm
[15:11:02] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1163
[15:11:02] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1163
[15:11:03] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.dns.netbox
[15:11:04] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1164.eqiad.wmnet with OS bookworm
[15:11:07] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1164
[15:11:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1164
[15:11:08] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1165.eqiad.wmnet with OS bookworm
[15:11:11] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1165
[15:11:11] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1165
[15:11:34] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1159.eqiad.wmnet with OS bookworm
[15:11:38] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1159
[15:11:38] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1159
[15:12:27] <wikibugs>	 (03PS1) 10Scott French: mw-(api-ext|web): temporarily increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114400 (https://phabricator.wikimedia.org/T383845)
[15:14:22] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update reference-quality storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114401 (https://phabricator.wikimedia.org/T384172)
[15:15:01] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add v6 cloud-private address for cloudlb2003-dev - taavi@cumin1002"
[15:15:05] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add v6 cloud-private address for cloudlb2003-dev - taavi@cumin1002"
[15:15:05] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:15:14] <wikibugs>	 (03PS28) 10Bking: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[15:15:22] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[15:15:24] <icinga-wm>	 PROBLEM - Etcd cluster health on wikikube-ctrl1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[15:15:34] <icinga-wm>	 PROBLEM - Etcd cluster health on wikikube-ctrl1001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[15:16:27] <jayme>	 this is expected
[15:16:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner
[15:16:41] <swfrench-wmf>	 jayme: thank you - was just about to ask :)
[15:16:48] <jayme>	 although not anticipated
[15:16:58] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw-(api-ext|web): temporarily increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114400 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:18:32] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache cloudlb2003-dev.private.codfw.wikimedia.cloud on all recursors
[15:18:35] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb2003-dev.private.codfw.wikimedia.cloud on all recursors
[15:18:51] <jinxer-wm>	 FIRING: [2x] KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[15:19:27] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): temporarily increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114400 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:19:28] <jayme>	 swfrench-wmf: I'm misstaken...
[15:19:32] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1304.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1304.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:20:12] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wikikube-ctrl1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:22] <swfrench-wmf>	 !incidents
[15:20:23] <sirenbot>	 5636 (UNACKED)  [4x] ProbeDown sre (probes/custom eqiad)
[15:20:23] <sirenbot>	 5635 (RESOLVED)  db2182 (paged)/MariaDB Replica SQL: s7 (paged)
[15:20:23] <sirenbot>	 5634 (RESOLVED)  db1241 (paged)/MariaDB Replica SQL: s4 (paged)
[15:20:29] <swfrench-wmf>	 !ack 5636
[15:20:30] <sirenbot>	 5636 (ACKED)  [4x] ProbeDown sre (probes/custom eqiad)
[15:20:33] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[15:20:35] <jayme>	 swfrench-wmf: this is me
[15:20:39] <jayme>	 shit
[15:20:50] <_joe_>	 something's paging
[15:20:52] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-ext|web): temporarily increase maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114400 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:21:01] <_joe_>	 jayme: is that you?
[15:21:09] <swfrench-wmf>	 jayme: thanks! yeah, let me know how I can help. in the meantime, I'm starting to look in parallel
[15:21:25] <jayme>	 yeah it's me
[15:21:33] <_joe_>	 swfrench-wmf: isn't this just the kube controller going down?
[15:21:34] <jayme>	 not etcd is blocking
[15:22:05] <_joe_>	 jayme: come again?
[15:22:10] <swfrench-wmf>	 ah, yeah I assumed this was the etcd issue causing that
[15:22:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10497387 (10VRiley-WMF)
[15:22:58] <_joe_>	 yeah sorry I was not looking at IRC at the moment
[15:23:23] <mutante>	 here
[15:23:27] <_joe_>	 do we need an incident doc? 
[15:23:35] <_joe_>	 I don't think so, right?
[15:23:52] <jayme>	 I took down ctrl nodes in eqiad, expecting 2 remaining to be okay...they are about to be back
[15:23:53] <wikibugs>	 (03PS5) 10Bking: opensearch: Introduce resource for keystore values [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[15:23:56] <_joe_>	 swfrench-wmf: I would prepare to depool eqiad services
[15:24:06] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[15:24:14] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 0.62 ms
[15:24:18] <jayme>	 yes please
[15:24:25] <jayme>	 but give it another minute
[15:25:04] <swfrench-wmf>	 _joe_: jayme: ack, yeah I will not touch that yet, but will start sorting the logistics
[15:25:24] <icinga-wm>	 RECOVERY - Etcd cluster health on wikikube-ctrl1004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[15:25:32] <swfrench-wmf>	 \o/
[15:25:34] <icinga-wm>	 RECOVERY - Etcd cluster health on wikikube-ctrl1001 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[15:25:46] <_joe_>	 this happened because I logged into one node
[15:25:51] <_joe_>	 etcd fears me
[15:25:52] <_joe_>	 :P
[15:26:02] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1160.eqiad.wmnet with reason: host reimage
[15:26:04] <swfrench-wmf>	 :)
[15:26:10] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1277:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:26:25] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1161.eqiad.wmnet with reason: host reimage
[15:26:33] <swfrench-wmf>	 alright, I see API operations succeeding again
[15:26:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1162.eqiad.wmnet with reason: host reimage
[15:26:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1165.eqiad.wmnet with reason: host reimage
[15:26:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1163.eqiad.wmnet with reason: host reimage
[15:27:01] <_joe_>	 yeah crisis averted
[15:27:05] <jayme>	 swfrench-wmf: _joe_: should be good
[15:27:17] <swfrench-wmf>	 jayme: great, thank you!
[15:27:19] <_joe_>	 to be clear, I wasn't suggesting to already depool eqiad, but just to be ready to :)
[15:27:36] <jayme>	 well...100% my fault so please don't thank me :|
[15:27:38] <swfrench-wmf>	 curious ... how did taking down a single control plane node do that?
[15:27:42] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch: Introduce resource for keystore values [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[15:27:54] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "let's figure out how to do proper sanity checks in a separate ticket" [puppet] - 10https://gerrit.wikimedia.org/r/1091325 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[15:27:58] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1159.eqiad.wmnet with reason: host reimage
[15:28:07] <swfrench-wmf>	 _joe_: yeah, totally - it was a good moment to start considering the "checklist" so to speak, though :)
[15:28:29] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wmflib,pybal: Add scheduler_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1114352 (https://phabricator.wikimedia.org/T373027) (owner: 10Vgutierrez)
[15:28:51] <jinxer-wm>	 RESOLVED: [2x] KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[15:29:32] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: wikikube-worker1304.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1304.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:29:47] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[15:29:55] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:30:01] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:30:02] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[15:30:12] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service wikikube-ctrl1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:30:23] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] beta EventStreamConfig - set eventgate hoist_fields_from_http_headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114149 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[15:30:32] <jinxer-wm>	 RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[15:30:37] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[15:30:42] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.dns.netbox
[15:31:06] <wikibugs>	 (03Merged) 10jenkins-bot: beta EventStreamConfig - set eventgate hoist_fields_from_http_headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114149 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[15:32:25] <icinga-wm>	 PROBLEM - Etcd cluster health on wikikube-ctrl1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[15:32:39] <icinga-wm>	 PROBLEM - Etcd cluster health on wikikube-ctrl1001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[15:32:50] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1160.eqiad.wmnet with reason: host reimage
[15:32:51] <jayme>	 the heck...
[15:32:57] <jayme>	 swfrench-wmf: problem still
[15:33:15] <ottomata>	 swfrench-wmf: i just merged a mw-config change in InitialiseSettings-labs.php.  
[15:33:15] <ottomata>	 I don't need to scap deploy it in prod at all, but was going to do so just for good practice.  Yall look busy(!) so perhaps I should skip this step?
[15:33:45] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:33:51] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:34:14] <swfrench-wmf>	 jayme: ack, thanks - holding. lemme know if you need more hands.
[15:34:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72491 and previous config saved to /var/cache/conftool/dbconfig/20250127-153435-marostegui.json
[15:34:40] <effie>	 ottomata: it would be great if next time you would do it during a mediawiki backport window, since, in theory, we would be using this one for an infra deployment 
[15:34:41] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[15:34:50] <jayme>	 swfrench-wmf:  I'm in touch with dcops ... the nodes had network cables switched
[15:35:12] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wikikube-ctrl1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:35:17] <swfrench-wmf>	 jayme: ah, that's fun
[15:35:22] <swfrench-wmf>	 !incidents
[15:35:23] <sirenbot>	 5637 (UNACKED)  [4x] ProbeDown sre (probes/custom eqiad)
[15:35:23] <sirenbot>	 5636 (RESOLVED)  [4x] ProbeDown sre (probes/custom eqiad)
[15:35:24] <sirenbot>	 5635 (RESOLVED)  db2182 (paged)/MariaDB Replica SQL: s7 (paged)
[15:35:24] <sirenbot>	 5634 (RESOLVED)  db1241 (paged)/MariaDB Replica SQL: s4 (paged)
[15:35:27] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add v6 cloud-private address for cloudlb2002-dev - taavi@cumin1002"
[15:35:29] <swfrench-wmf>	 !ack 5637
[15:35:30] <sirenbot>	 5637 (ACKED)  [4x] ProbeDown sre (probes/custom eqiad)
[15:35:31] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add v6 cloud-private address for cloudlb2002-dev - taavi@cumin1002"
[15:35:31] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:35:47] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache cloudlb2002-dev.private.codfw.wikimedia.cloud on all recursors
[15:35:51] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudlb2002-dev.private.codfw.wikimedia.cloud on all recursors
[15:35:53] <ottomata>	 effie:  i'm sorry!  you are right.  I can revert if you prefer!  I proceeded since it was just beta, but then asked in slack and realized good practice is to deploy -labs.php files in prod too, even though they are not used there.
[15:36:25] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1161.eqiad.wmnet with reason: host reimage
[15:36:27] <topranks>	 is the latest alert same issue as previous one?
[15:36:46] <_joe_>	 ottomata: wait for our green light, then deploy
[15:36:51] <jinxer-wm>	 FIRING: [3x] KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[15:37:01] <ottomata>	 _joe_:  okay, will wait. ty
[15:37:03] <swfrench-wmf>	 topranks: it sounds like at least in part? though I'm not sure about the details
[15:37:40] <jayme>	 topranks: yes
[15:37:55] <_joe_>	 jayme: let us know what's going on / if we can help
[15:38:19] <swfrench-wmf>	 jayme: is dc ops taking action that would resolve this, or do we need to do something?
[15:38:33] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[15:39:00] <jayme>	 dcops is aware and trying to fix
[15:39:16] <jayme>	 not sure why we lost connectivity again to ctrl1003
[15:39:23] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1162.eqiad.wmnet with reason: host reimage
[15:39:55] <swfrench-wmf>	 jayme: ack, thanks
[15:42:41] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1159.eqiad.wmnet with reason: host reimage
[15:43:10] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on wikikube-worker1078:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:43:25] <icinga-wm>	 RECOVERY - Etcd cluster health on wikikube-ctrl1004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[15:43:41] <icinga-wm>	 RECOVERY - Etcd cluster health on wikikube-ctrl1001 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[15:44:14] * swfrench-wmf is cautiously optimistic
[15:44:51] <hnowlan>	 k8s api calls working in eqiad 
[15:44:55] <swfrench-wmf>	 alright, API operations are back
[15:45:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10497449 (10VRiley-WMF)
[15:45:12] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service wikikube-ctrl1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:45:49] <jayme>	 ctrl1002 is back, 1003 still unreachable
[15:45:50] <swfrench-wmf>	 jayme: do you need coordination assistance? e.g., would a doc help here (I can IC)
[15:45:53] <wikibugs>	 (03PS29) 10Bking: Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 (https://phabricator.wikimedia.org/T380752) (owner: 10Ebernhardson)
[15:45:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10497468 (10VRiley-WMF)
[15:46:01] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1165.eqiad.wmnet with reason: host reimage
[15:46:08] <jayme>	 swfrench-wmf: no, thanks. Should be "good" now
[15:46:43] <wikibugs>	 (03PS1) 10Ottomata: beta wgEventStreams - set hoist_fields_from_http_headers on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114409 (https://phabricator.wikimedia.org/T382173)
[15:46:51] <jinxer-wm>	 RESOLVED: [3x] KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[15:47:22] <swfrench-wmf>	 jayme: got it, thank you! so 1 of 3 nodes is still unavailable, presumably due to a network issue IIUC?
[15:48:06] <jayme>	 yes, but that one is back up now as well
[15:48:10] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1078:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:48:31] <wikibugs>	 06SRE, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10497480 (10Dzahn) Whatever we end up doing, let's resist the temptation to create yet another "-feed" channel (that few look at) because that...
[15:48:33] <jinxer-wm>	 RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[15:48:34] <swfrench-wmf>	 awesome
[15:48:49] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[15:49:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1163.eqiad.wmnet with reason: host reimage
[15:49:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72493 and previous config saved to /var/cache/conftool/dbconfig/20250127-154933-root.json
[15:49:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P72494 and previous config saved to /var/cache/conftool/dbconfig/20250127-154942-marostegui.json
[15:50:42] <jayme>	 swfrench-wmf: etcd is all happy again
[15:51:22] <swfrench-wmf>	 jayme: awesome, thank you for confirming!
[15:51:31] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[15:51:39] <swfrench-wmf>	 e.ffie and I will venture a backport deployment shortly, then
[15:52:24] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  frnetmon1002 - vriley@cumin1002"
[15:52:29] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  frnetmon1002 - vriley@cumin1002"
[15:52:29] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:53:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1160.eqiad.wmnet with OS bookworm
[15:54:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1161.eqiad.wmnet with OS bookworm
[15:56:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2194 T384807', diff saved to https://phabricator.wikimedia.org/P72495 and previous config saved to /var/cache/conftool/dbconfig/20250127-155613-marostegui.json
[15:56:18] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[15:56:31] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2194.codfw.wmnet
[15:57:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113566 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:58:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1162.eqiad.wmnet with OS bookworm
[15:58:38] <wikibugs>	 (03Merged) 10jenkins-bot: Enroll 0.1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113566 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[16:00:10] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1113566|Enroll 0.1% of client sessions in PHP 8.1 (T383845)]]
[16:00:14] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[16:00:31] <icinga-wm>	 RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[16:00:43] <effie>	 ottomata: we will deploy your patch too as we are scap backporting
[16:01:35] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2194.codfw.wmnet
[16:01:52] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] kserve: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114012 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[16:01:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1159.eqiad.wmnet with OS bookworm
[16:02:31] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Index rebuild
[16:02:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P72496 and previous config saved to /var/cache/conftool/dbconfig/20250127-160237-root.json
[16:02:51] <icinga-wm>	 PROBLEM - Host ganeti2020 is DOWN: PING CRITICAL - Packet loss = 100%
[16:03:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2194', diff saved to https://phabricator.wikimedia.org/P72497 and previous config saved to /var/cache/conftool/dbconfig/20250127-160300-marostegui.json
[16:04:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72498 and previous config saved to /var/cache/conftool/dbconfig/20250127-160438-root.json
[16:04:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P72499 and previous config saved to /var/cache/conftool/dbconfig/20250127-160449-marostegui.json
[16:05:01] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1113566|Enroll 0.1% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:05:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1165.eqiad.wmnet with OS bookworm
[16:06:06] <wikibugs>	 (03Merged) 10jenkins-bot: kserve: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114012 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[16:07:45] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Index rebuild T384807
[16:07:49] <stashbot>	 T384807: Upgrade and rebuild s3 - https://phabricator.wikimedia.org/T384807
[16:08:35] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1164.eqiad.wmnet with OS bookworm
[16:08:37] <jinxer-wm>	 FIRING: ProbeDown: Service ganeti2020:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:08:51] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1163.eqiad.wmnet with OS bookworm
[16:09:47] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1164.eqiad.wmnet with OS bookworm
[16:09:51] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1164
[16:09:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1164
[16:11:41] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[16:12:38] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2020
[16:12:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2020
[16:13:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2025.codfw.wmnet to cluster codfw and group D
[16:15:59] <icinga-wm>	 RECOVERY - Host ganeti2020 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms
[16:16:07] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2025.codfw.wmnet to cluster codfw and group D
[16:18:21] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113566|Enroll 0.1% of client sessions in PHP 8.1 (T383845)]] (duration: 18m 11s)
[16:18:24] <wikibugs>	 (03PS1) 10Klausman: admin_ng/values/ml-staging: add cluster_group [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114414 (https://phabricator.wikimedia.org/T369493)
[16:18:26] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[16:18:38] <jinxer-wm>	 RESOLVED: ProbeDown: Service ganeti2020:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72501 and previous config saved to /var/cache/conftool/dbconfig/20250127-161932-root.json
[16:19:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72502 and previous config saved to /var/cache/conftool/dbconfig/20250127-161944-root.json
[16:19:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T384592)', diff saved to https://phabricator.wikimedia.org/P72503 and previous config saved to /var/cache/conftool/dbconfig/20250127-161956-marostegui.json
[16:20:01] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[16:20:03] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable haproxykafka on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578)
[16:20:12] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2171.codfw.wmnet with reason: Maintenance
[16:20:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T384592)', diff saved to https://phabricator.wikimedia.org/P72504 and previous config saved to /var/cache/conftool/dbconfig/20250127-162018-marostegui.json
[16:21:05] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[16:23:33] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable haproxykafka on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578)
[16:25:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1164.eqiad.wmnet with reason: host reimage
[16:26:07] <wikibugs>	 (03Abandoned) 10Fabfur: hiera: enable haproxykafka on all datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1114392 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[16:28:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1164.eqiad.wmnet with reason: host reimage
[16:30:05] <jouncebot>	 jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1630).
[16:30:24] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[16:30:40] <effie>	 swfrench-wm.f and I will be using  the Wikimedia Portals Update deploy window folks
[16:30:40] <wikibugs>	 (03CR) 10Fabfur: [C:04-1] "Do not merge before 28/01/2025" [puppet] - 10https://gerrit.wikimedia.org/r/1114415 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[16:30:59] <wikibugs>	 (03CR) 10Fabfur: [C:04-1] "Do not merge before 28/01/2025" [puppet] - 10https://gerrit.wikimedia.org/r/1114417 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur)
[16:31:29] <wikibugs>	 (03CR) 10Herron: [C:03+1] thanos: send sigkill as needed to stateless components [puppet] - 10https://gerrit.wikimedia.org/r/1114336 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi)
[16:31:34] <ottomata>	 effie: okay thanks, I have another one i didn't merge
[16:31:43] <ottomata>	 (meetings started anyway)
[16:31:46] <wikibugs>	 (03Abandoned) 10Klausman: admin_ng/values/ml-staging: add cluster_group [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114414 (https://phabricator.wikimedia.org/T369493) (owner: 10Klausman)
[16:32:02] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[16:32:18] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[16:32:48] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl[1002-1003].eqiad.wmnet
[16:32:50] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl[1002-1003].eqiad.wmnet
[16:32:51] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl[1002-1003].eqiad.wmnet
[16:32:52] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl[1002-1003].eqiad.wmnet
[16:34:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 2%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72505 and previous config saved to /var/cache/conftool/dbconfig/20250127-163437-root.json
[16:34:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72506 and previous config saved to /var/cache/conftool/dbconfig/20250127-163449-root.json
[16:35:10] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1003.eqiad.wmnet with OS bookworm
[16:39:55] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] knative-serving: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114016 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[16:40:50] <wikibugs>	 (03PS1) 10Elukey: services: update Kartotherian's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114420 (https://phabricator.wikimedia.org/T384530)
[16:41:28] <wikibugs>	 (03PS2) 10Elukey: services: update Kartotherian's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114420 (https://phabricator.wikimedia.org/T384530)
[16:42:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T384592)', diff saved to https://phabricator.wikimedia.org/P72507 and previous config saved to /var/cache/conftool/dbconfig/20250127-164231-marostegui.json
[16:42:37] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[16:43:56] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update Kartotherian's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114420 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey)
[16:43:59] <wikibugs>	 (03Merged) 10jenkins-bot: knative-serving: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114016 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[16:44:39] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[16:45:25] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[16:46:48] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2075']
[16:47:22] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2075']
[16:48:01] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1164.eqiad.wmnet with OS bookworm
[16:49:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 3%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72508 and previous config saved to /var/cache/conftool/dbconfig/20250127-164942-root.json
[16:49:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72509 and previous config saved to /var/cache/conftool/dbconfig/20250127-164955-root.json
[16:52:18] <swfrench-wmf>	 !jouncebot nowandnext
[16:52:18] <wm-bot>	 a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot
[16:52:35] <swfrench-wmf>	 jouncebot: nowandnext
[16:52:35] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1630)
[16:52:35] <jouncebot>	 In 1 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800)
[16:52:35] <jouncebot>	 In 1 hour(s) and 7 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800)
[16:52:37] <swfrench-wmf>	 lol
[16:52:56] <Lucas_WMDE>	 hehe
[16:54:45] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[16:56:18] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[16:57:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P72510 and previous config saved to /var/cache/conftool/dbconfig/20250127-165738-marostegui.json
[16:58:24] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1159-1165].eqiad.wmnet
[16:58:26] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1159-1165].eqiad.wmnet
[16:58:32] <swfrench-wmf>	 alright, after a bit of a delay, we're going to ramp the fraction of enrolled traffic up a bit more (still at / below 1% of external web / API traffic)
[17:01:01] <wikibugs>	 (03PS2) 10Scott French: Enroll 1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113567 (https://phabricator.wikimedia.org/T383845)
[17:03:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113567 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[17:03:49] <wikibugs>	 (03Merged) 10jenkins-bot: Enroll 1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113567 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[17:04:04] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1113567|Enroll 1% of client sessions in PHP 8.1 (T383845)]]
[17:04:09] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[17:04:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 4%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72511 and previous config saved to /var/cache/conftool/dbconfig/20250127-170448-root.json
[17:05:30] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:09:05] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1113567|Enroll 1% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:09:10] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[17:09:21] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[17:09:27] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[17:10:10] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[17:10:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:12:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P72512 and previous config saved to /var/cache/conftool/dbconfig/20250127-171245-marostegui.json
[17:12:57] <wikibugs>	 (03PS1) 10Elukey: admin_ng: disable PSP mutation for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114423 (https://phabricator.wikimedia.org/T369493)
[17:13:53] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: disable PSP mutation for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114423 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[17:14:45] <ottomata>	 effie: swfrench-wmf, are you all still deploying?  I have another beta only patch to merge.  I don't need it deployed in production, but it should go out eventually.  
[17:14:45] <ottomata>	 I absolutely can wait if that is better for you
[17:15:19] <swfrench-wmf>	 ottomata: thanks for checking! yes, we're still deploying, so that would be great if you could hold.
[17:15:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:18:06] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113567|Enroll 1% of client sessions in PHP 8.1 (T383845)]] (duration: 14m 02s)
[17:18:11] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[17:18:15] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:19:24] <wikibugs>	 (03PS1) 10C. Scott Ananian: Condense wikivoyage configuration options for Parsoid Read Views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367)
[17:19:32] <wikibugs>	 (03CR) 10Clare Ming: "should we enable this for labswiki too?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) (owner: 10Phuedx)
[17:19:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72513 and previous config saved to /var/cache/conftool/dbconfig/20250127-171953-root.json
[17:22:17] <wikibugs>	 (03PS2) 10C. Scott Ananian: Condense wikivoyage configuration options for Parsoid Read Views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367)
[17:24:29] <ottomata>	 swfrench-wmf: 👍 ty
[17:25:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.143s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:26:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM, lint error seems unrelated, think we need to add a line telling it to ignore it for that function" [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 (owner: 10Muehlenhoff)
[17:27:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T384592)', diff saved to https://phabricator.wikimedia.org/P72514 and previous config saved to /var/cache/conftool/dbconfig/20250127-172752-marostegui.json
[17:27:57] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[17:28:08] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2178.codfw.wmnet with reason: Maintenance
[17:28:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T384592)', diff saved to https://phabricator.wikimedia.org/P72515 and previous config saved to /var/cache/conftool/dbconfig/20250127-172814-marostegui.json
[17:28:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:30:05] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update reference-quality storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114401 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou)
[17:30:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.143s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:30:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:34:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72516 and previous config saved to /var/cache/conftool/dbconfig/20250127-173458-root.json
[17:35:42] <effie>	 jouncebot: now
[17:35:43] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC afternoon, 2nd attempt) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1705)
[17:35:49] <effie>	 ottomata: we are done, you could use the rest of our window if you want 
[17:38:30] <ottomata>	 effie:  ty
[17:38:35] <ottomata>	 doing!
[17:40:20] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] beta wgEventStreams - set hoist_fields_from_http_headers on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114409 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[17:41:17] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Condense wikivoyage configuration options for Parsoid Read Views [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114425 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian)
[17:41:33] <wikibugs>	 (03Merged) 10jenkins-bot: beta wgEventStreams - set hoist_fields_from_http_headers on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114409 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[17:48:10] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[17:48:28] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[17:48:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T384592)', diff saved to https://phabricator.wikimedia.org/P72517 and previous config saved to /var/cache/conftool/dbconfig/20250127-174833-marostegui.json
[17:48:39] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[17:49:22] <wikibugs>	 (03PS1) 10Volans: sre.hosts.decommission: fix CI [cookbooks] - 10https://gerrit.wikimedia.org/r/1114432
[17:50:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72518 and previous config saved to /var/cache/conftool/dbconfig/20250127-175004-root.json
[17:50:12] <wikibugs>	 (03CR) 10Volans: "rebasing on top of I0b9bd18c5c9d606dca49c580075b2aa0e9e9a677 should fix it" [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 (owner: 10Muehlenhoff)
[17:54:40] <wikibugs>	 (03CR) 10Dzahn: [V:04-1 C:04-1] "Ah yea.. so I made this before there was "profile::tlsproxy::envoy::firewall_src_sets". It was an attempt to make it work with an empty (n" [puppet] - 10https://gerrit.wikimedia.org/r/1055491 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[17:55:24] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudgw1003.eqiad.wmnet with OS bookworm
[17:55:43] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114433
[17:55:56] <Reedy>	 jouncebot: nowandnext
[17:55:57] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC afternoon, 2nd attempt) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1705)
[17:55:57] <jouncebot>	 In 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800)
[17:55:57] <jouncebot>	 In 0 hour(s) and 4 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800)
[17:56:20] <wikibugs>	 (03CR) 10Volans: [C:03+2] "self merging, trivial." [cookbooks] - 10https://gerrit.wikimedia.org/r/1114432 (owner: 10Volans)
[17:56:54] <wikibugs>	 (03CR) 10Dzahn: [V:04-1 C:04-1] "let's use your patch in this case. it already has more lines and is more current. I will abandon this one in favor of yours." [puppet] - 10https://gerrit.wikimedia.org/r/1055491 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[17:57:05] <wikibugs>	 (03Abandoned) 10Dzahn: ci: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055491 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[18:00:05] <jouncebot>	 swfrench-wmf and effie: Your horoscope predicts another MediaWiki infrastructure (UTC afternoon, 2nd attempt) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1705).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800)
[18:00:05] <jouncebot>	 ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T1800).
[18:00:12] <wikibugs>	 (03PS1) 10Reedy: LicenseParser: Avoid passing null to string functions [extensions/CommonsMetadata] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114435 (https://phabricator.wikimedia.org/T384853)
[18:00:24] <wikibugs>	 (03CR) 10Reedy: [C:03+2] LicenseParser: Avoid passing null to string functions [extensions/CommonsMetadata] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114435 (https://phabricator.wikimedia.org/T384853) (owner: 10Reedy)
[18:01:13] <swfrench-wmf>	 ah, interesting side effect of overlapping deployment windows :)
[18:02:19] <swfrench-wmf>	 no further deployments planned on our end, but it looks like R.eedy is preparing cherrypicks to backport for the could of deprecation errors we've seen
[18:02:20] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.decommission: fix CI [cookbooks] - 10https://gerrit.wikimedia.org/r/1114432 (owner: 10Volans)
[18:03:13] <swfrench-wmf>	 more thank happy to see those move during the window if ready
[18:03:16] <swfrench-wmf>	 *than
[18:03:32] <wikibugs>	 (03Merged) 10jenkins-bot: LicenseParser: Avoid passing null to string functions [extensions/CommonsMetadata] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114435 (https://phabricator.wikimedia.org/T384853) (owner: 10Reedy)
[18:03:34] <wikibugs>	 (03PS3) 10Muehlenhoff: sre.hosts.reimage: Add link to the help text for move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171
[18:03:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P72520 and previous config saved to /var/cache/conftool/dbconfig/20250127-180341-marostegui.json
[18:05:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72521 and previous config saved to /var/cache/conftool/dbconfig/20250127-180509-root.json
[18:05:40] <wikibugs>	 (03PS14) 10Dzahn: releases: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (https://phabricator.wikimedia.org/T370677)
[18:07:02] <wikibugs>	 (03PS15) 10Dzahn: releases: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (https://phabricator.wikimedia.org/T370677)
[18:08:28] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114437
[18:08:31] <wikibugs>	 (03PS1) 10Dzahn: Revert "gerrit: block alibaba Cloud IPs" [puppet] - 10https://gerrit.wikimedia.org/r/1114438
[18:08:51] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Upgrade to CAS 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1114388 (owner: 10Slyngshede)
[18:09:02] <Reedy>	 Gets the fixed noise out of the way... especially when I'd probably expect more like that from the same underlying PHP functions
[18:09:10] <wikibugs>	 (03CR) 10Dzahn: "Just created this because I saw someone added a TODO to revert this. Do you (still) think we should revert it now or just keep it as addit" [puppet] - 10https://gerrit.wikimedia.org/r/1114438 (owner: 10Dzahn)
[18:10:01] <logmsgbot>	 !log tchin@deploy2002 Started deploy [airflow-dags/analytics@c49f40b]: Deploying airflow for T357684
[18:10:06] <stashbot>	 T357684: Dashboard and alerting of data quality metrics for wmf_content.mediawiki_content_history_v1 - https://phabricator.wikimedia.org/T357684
[18:10:39] <logmsgbot>	 !log tchin@deploy2002 Finished deploy [airflow-dags/analytics@c49f40b]: Deploying airflow for T357684 (duration: 01m 01s)
[18:13:08] <wikibugs>	 (03PS1) 10Daimona Eaytoy: prod: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818)
[18:13:38] <wikibugs>	 (03CR) 10Clare Ming: [C:03+1] testwiki: Enable MetricsPlatform experiment enrollment overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) (owner: 10Phuedx)
[18:13:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818) (owner: 10Daimona Eaytoy)
[18:13:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prod: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818) (owner: 10Daimona Eaytoy)
[18:14:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) (owner: 10Phuedx)
[18:14:08] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "so.. rebased and "just switch it" currently fails in this manner, FYI:" [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[18:15:01] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "so it's still requiring / looking for ferm related resources" [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[18:16:43] <logmsgbot>	 !log reedy@deploy2002 Synchronized php-1.44.0-wmf.13/extensions/CommonsMetadata/: T384853 T384854 (duration: 10m 45s)
[18:16:49] <stashbot>	 T384853: PHP Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384853
[18:16:49] <stashbot>	 T384854: PHP Deprecated: strtolower(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384854
[18:17:15] <wikibugs>	 (03PS2) 10Daimona Eaytoy: prod: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114440 (https://phabricator.wikimedia.org/T380818)
[18:18:23] <wikibugs>	 (03PS2) 10BCornwall: controol: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074
[18:18:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P72522 and previous config saved to /var/cache/conftool/dbconfig/20250127-181847-marostegui.json
[18:20:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72523 and previous config saved to /var/cache/conftool/dbconfig/20250127-182014-root.json
[18:20:15] <wikibugs>	 (03CR) 10BCornwall: "Thanks, makes sense! I've updated the PS" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[18:20:51] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1004.eqiad.wmnet with OS bookworm
[18:23:09] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4866/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[18:23:34] <wikibugs>	 (03PS1) 10Dzahn: gerrit: remove UA-based blocking of some old bots/spiders [puppet] - 10https://gerrit.wikimedia.org/r/1114442
[18:24:32] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4867/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[18:26:27] <wikibugs>	 (03PS3) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[18:26:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe)
[18:26:50] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4868/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[18:27:03] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudgw1004.eqiad.wmnet with OS bookworm
[18:27:48] <wikibugs>	 (03PS3) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225)
[18:28:35] <wikibugs>	 (03CR) 10Hashar: "Some of those crawlers still have hit (Baidu, Sogou, bingbot). I revisited them some months ago :)" [puppet] - 10https://gerrit.wikimedia.org/r/1114442 (owner: 10Dzahn)
[18:30:23] <wikibugs>	 (03PS4) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[18:30:44] <wikibugs>	 (03CR) 10Dzahn: "digging back further, these are PRE 2012/2013" [puppet] - 10https://gerrit.wikimedia.org/r/1114442 (owner: 10Dzahn)
[18:30:54] <wikibugs>	 (03PS2) 10Dzahn: gerrit: remove UA-based blocking of some old bots/spiders [puppet] - 10https://gerrit.wikimedia.org/r/1114442
[18:31:54] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[18:33:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T384592)', diff saved to https://phabricator.wikimedia.org/P72524 and previous config saved to /var/cache/conftool/dbconfig/20250127-183355-marostegui.json
[18:34:00] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[18:34:10] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2192.codfw.wmnet with reason: Maintenance
[18:34:17] <wikibugs>	 (03PS4) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225)
[18:34:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T384592)', diff saved to https://phabricator.wikimedia.org/P72525 and previous config saved to /var/cache/conftool/dbconfig/20250127-183417-marostegui.json
[18:35:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72526 and previous config saved to /var/cache/conftool/dbconfig/20250127-183519-root.json
[18:35:27] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  pay-lb1001 - vriley@cumin1002"
[18:35:31] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  pay-lb1001 - vriley@cumin1002"
[18:35:32] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:36:20] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114146 (owner: 10TrainBranchBot)
[18:44:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114445
[18:46:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P72527 and previous config saved to /var/cache/conftool/dbconfig/20250127-184642-root.json
[18:48:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2194', diff saved to https://phabricator.wikimedia.org/P72528 and previous config saved to /var/cache/conftool/dbconfig/20250127-184839-marostegui.json
[18:51:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T384592)', diff saved to https://phabricator.wikimedia.org/P72529 and previous config saved to /var/cache/conftool/dbconfig/20250127-185104-marostegui.json
[18:51:09] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[18:52:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): decommission cloudelastic100[5-6] - https://phabricator.wikimedia.org/T380937#10498376 (10Papaul)
[18:52:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): decommission cloudelastic100[5-6] - https://phabricator.wikimedia.org/T380937#10498379 (10Papaul) 05Open→03Resolved a:03Papaul complete
[18:56:12] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[18:56:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384645#10498403 (10Papaul) 05Open→03Resolved a:03Papaul fixed
[18:57:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72530 and previous config saved to /var/cache/conftool/dbconfig/20250127-185715-root.json
[18:59:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1114445 (owner: 10TrainBranchBot)
[18:59:42] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  pay-lb1002~ - vriley@cumin1002"
[18:59:46] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  pay-lb1002~ - vriley@cumin1002"
[18:59:46] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:00:02] <wikibugs>	 (03PS1) 10Mstyles: security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114446 (https://phabricator.wikimedia.org/T383098)
[19:00:43] <wikibugs>	 (03CR) 10Ssingh: "Your change looks good but I think we will need to update one more thing and additionally, check that we are not referencing ats-be anywhe" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[19:06:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P72531 and previous config saved to /var/cache/conftool/dbconfig/20250127-190611-marostegui.json
[19:08:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:12:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72532 and previous config saved to /var/cache/conftool/dbconfig/20250127-191220-root.json
[19:13:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[19:15:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10498484 (10VRiley-WMF)
[19:16:52] <wikibugs>	 06SRE, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10498487 (10Quiddity) Side-note in case it helps anyone (and probably only potentially helps IRCCloud users): I've been using some custom-CSS...
[19:17:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding puppetserver2004 to codfw - jhancock@cumin2002"
[19:17:16] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding puppetserver2004 to codfw - jhancock@cumin2002"
[19:17:16] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:18:15] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:18:32] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:19:27] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:21:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P72533 and previous config saved to /var/cache/conftool/dbconfig/20250127-192118-marostegui.json
[19:21:50] <wikibugs>	 (03PS1) 10Jforrester: MemcachedBagOStuff: Null coalescing $component [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114448 (https://phabricator.wikimedia.org/T384858)
[19:23:36] <wikibugs>	 (03PS1) 10Ottomata: beta wgEventStreams - test Opt out of collecting user-agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114449 (https://phabricator.wikimedia.org/T382173)
[19:24:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:25:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:25:45] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:26:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10498501 (10Jhancock.wm)
[19:26:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10498503 (10Jhancock.wm) provisioning failing. will check bios settings later and then try again.
[19:27:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72534 and previous config saved to /var/cache/conftool/dbconfig/20250127-192725-root.json
[19:28:02] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] beta wgEventStreams - test Opt out of collecting user-agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114449 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[19:28:53] <wikibugs>	 (03Merged) 10jenkins-bot: beta wgEventStreams - test Opt out of collecting user-agent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114449 (https://phabricator.wikimedia.org/T382173) (owner: 10Ottomata)
[19:36:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T384592)', diff saved to https://phabricator.wikimedia.org/P72535 and previous config saved to /var/cache/conftool/dbconfig/20250127-193625-marostegui.json
[19:36:30] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[19:36:41] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2201.codfw.wmnet with reason: Maintenance
[19:42:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72536 and previous config saved to /var/cache/conftool/dbconfig/20250127-194231-root.json
[19:50:14] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[3-8] - https://phabricator.wikimedia.org/T384838#10498575 (10RobH)
[19:51:31] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[19:52:09] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[46-51] - https://phabricator.wikimedia.org/T384838#10498581 (10RobH)
[19:57:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72537 and previous config saved to /var/cache/conftool/dbconfig/20250127-195736-root.json
[19:59:34] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2211.codfw.wmnet with reason: Maintenance
[19:59:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T384592)', diff saved to https://phabricator.wikimedia.org/P72538 and previous config saved to /var/cache/conftool/dbconfig/20250127-195939-marostegui.json
[19:59:45] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[20:02:16] <wikibugs>	 (03PS1) 10Jforrester: SimpleCaptcha: Don't look up captcha if no ID was given [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114454 (https://phabricator.wikimedia.org/T384858)
[20:06:37] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10498642 (10RobH) After Andrew pinged about this today in IRC, I can see on the system it has the alarms on idrac: System Inlet Temperature  35 °C (95.0 °F) w...
[20:08:27] <wikibugs>	 (03CR) 10Hashar: "You can look at the accesslog via https://logstash.wikimedia.org/app/dashboards#/view/825c5c80-8aef-11eb-8ab2-63c7f3b019fc and filtering o" [puppet] - 10https://gerrit.wikimedia.org/r/1114442 (owner: 10Dzahn)
[20:13:55] <wikibugs>	 (03PS1) 10TheAnarcat: dump backtrace on exception, on --trace [software/cumin] - 10https://gerrit.wikimedia.org/r/1114456 (https://phabricator.wikimedia.org/T384539)
[20:18:08] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10498659 (10RobH) >>! In T383723#10498642, @RobH wrote: > After Andrew pinged about this today in IRC, I can see on the system it has the alarms on idrac: Sys...
[20:18:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T384592)', diff saved to https://phabricator.wikimedia.org/P72539 and previous config saved to /var/cache/conftool/dbconfig/20250127-201832-marostegui.json
[20:20:35] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] controol: rm ats-be services cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[20:21:34] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10498666 (10RLazarus)
[20:26:06] <wikibugs>	 06SRE, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Add x-analytics nocookie=1 and x-tls-sess to webrequest-sampled-live stream - https://phabricator.wikimedia.org/T383900#10498690 (10RLazarus)
[20:28:21] <wikibugs>	 (03CR) 10SBassett: [C:03+2] security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114446 (https://phabricator.wikimedia.org/T383098) (owner: 10Mstyles)
[20:28:42] <wikibugs>	 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384869 (10phaultfinder) 03NEW
[20:29:37] <wikibugs>	 (03Merged) 10jenkins-bot: security-landing-page: deploying update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114446 (https://phabricator.wikimedia.org/T383098) (owner: 10Mstyles)
[20:33:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P72540 and previous config saved to /var/cache/conftool/dbconfig/20250127-203339-marostegui.json
[20:36:37] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops: mwgrep cannot be used from a deployment host - https://phabricator.wikimedia.org/T384764#10498755 (10RLazarus) This isn't working because it was never upgraded to Python 3. (`reload` was a built-in function in Python 2, moved to `importlib` in 3.) The mwmaint hosts are still...
[20:36:44] <papaul>	 !log power down  logging-hd1005 for maintenance 
[20:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:27] <icinga-wm>	 PROBLEM - Host logging-hd1005 is DOWN: PING CRITICAL - Packet loss = 100%
[20:39:15] <wikibugs>	 06SRE, 07SRE-Unowned, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: Reduce automatic messages on #wikimedia-operations - https://phabricator.wikimedia.org/T384804#10498758 (10RLazarus)
[20:45:50] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[20:45:59] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-hd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[20:48:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P72541 and previous config saved to /var/cache/conftool/dbconfig/20250127-204846-marostegui.json
[20:50:49] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[20:56:22] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10498829 (10VRiley-WMF) IP address was not setup on the managment port. Reran the cookbook and it set it in place. This should be good to go.
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T2100).
[21:00:05] <jouncebot>	 cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:53] <cjming>	 hi ! i will self-deploy
[21:01:11] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[21:01:46] <wikibugs>	 (03PS2) 10Phuedx: testwiki: Enable MetricsPlatform experiment enrollment overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728)
[21:03:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) (owner: 10Phuedx)
[21:03:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T384592)', diff saved to https://phabricator.wikimedia.org/P72542 and previous config saved to /var/cache/conftool/dbconfig/20250127-210353-marostegui.json
[21:03:58] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[21:04:06] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Enable MetricsPlatform experiment enrollment overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114396 (https://phabricator.wikimedia.org/T384728) (owner: 10Phuedx)
[21:04:09] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2223.codfw.wmnet with reason: Maintenance
[21:04:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72543 and previous config saved to /var/cache/conftool/dbconfig/20250127-210415-marostegui.json
[21:04:26] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1114396|testwiki: Enable MetricsPlatform experiment enrollment overrides (T384728)]]
[21:04:30] <stashbot>	 T384728: Enable MetricsPlatform overrides on testwiki - https://phabricator.wikimedia.org/T384728
[21:05:09] <icinga-wm>	 RECOVERY - Host logging-hd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms
[21:08:07] <logmsgbot>	 !log cjming@deploy2002 phuedx, cjming: Backport for [[gerrit:1114396|testwiki: Enable MetricsPlatform experiment enrollment overrides (T384728)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:08:13] <logmsgbot>	 !log cjming@deploy2002 phuedx, cjming: Continuing with sync
[21:09:44] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10498863 (10VRiley-WMF) 05Open→03Resolved
[21:12:57] <wikibugs>	 (03CR) 10Gergő Tisza: "It just seems like there is a set of checks we'd need to repeat every time we add a new extension, or after certain changes in existing ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114351 (owner: 10Gergő Tisza)
[21:14:56] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114396|testwiki: Enable MetricsPlatform experiment enrollment overrides (T384728)]] (duration: 10m 30s)
[21:15:01] <stashbot>	 T384728: Enable MetricsPlatform overrides on testwiki - https://phabricator.wikimedia.org/T384728
[21:15:39] <cjming>	 i'll hang out for a bit in case anyone else shows up -- then close the backport window
[21:26:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72544 and previous config saved to /var/cache/conftool/dbconfig/20250127-212656-marostegui.json
[21:27:01] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[21:27:20] <wikibugs>	 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384869#10498919 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Reseated power supply. It shows that it's normal now.
[21:29:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10498923 (10phaultfinder)
[21:33:15] <cjming>	 !log end of UTC late backport window
[21:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P72546 and previous config saved to /var/cache/conftool/dbconfig/20250127-214203-marostegui.json
[21:57:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P72547 and previous config saved to /var/cache/conftool/dbconfig/20250127-215710-marostegui.json
[22:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250127T2200). nyaa~
[22:12:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T384592)', diff saved to https://phabricator.wikimedia.org/P72548 and previous config saved to /var/cache/conftool/dbconfig/20250127-221217-marostegui.json
[22:12:33] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2228.codfw.wmnet with reason: Maintenance
[22:12:48] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2186.codfw.wmnet with reason: Maintenance
[22:12:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T384592)', diff saved to https://phabricator.wikimedia.org/P72549 and previous config saved to /var/cache/conftool/dbconfig/20250127-221255-marostegui.json
[22:29:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T384592)', diff saved to https://phabricator.wikimedia.org/P72550 and previous config saved to /var/cache/conftool/dbconfig/20250127-222947-marostegui.json
[22:29:53] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[22:31:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10499096 (10VRiley-WMF) Is this okay to be closed?
[22:37:10] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add the service_proxy to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[22:38:03] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1114336 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi)
[22:38:28] <logmsgbot>	 !log mstyles@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[22:39:10] <logmsgbot>	 !log mstyles@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[22:39:36] <logmsgbot>	 !log mstyles@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[22:39:50] <logmsgbot>	 !log mstyles@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[22:40:21] <logmsgbot>	 !log mstyles@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[22:40:47] <logmsgbot>	 !log mstyles@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[22:41:10] <logmsgbot>	 !log mstyles@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[22:41:12] <logmsgbot>	 !log mstyles@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[22:41:20] <logmsgbot>	 !log mstyles@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[22:41:23] <logmsgbot>	 !log mstyles@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[22:44:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P72552 and previous config saved to /var/cache/conftool/dbconfig/20250127-224455-marostegui.json
[22:45:36] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudgw1003.eqiad.wmnet
[22:46:11] <logmsgbot>	 !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudgw1003.eqiad.wmnet
[22:46:20] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudgw1003.eqiad.wmnet
[22:51:17] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] "We added support for this today - in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114383" [puppet] - 10https://gerrit.wikimedia.org/r/1114376 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[22:55:35] <wikibugs>	 (03PS1) 10Btullis: Add the mw-misc service_proxy listener to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114470 (https://phabricator.wikimedia.org/T384329)
[22:57:02] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4870/co" [puppet] - 10https://gerrit.wikimedia.org/r/1114470 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[23:00:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P72553 and previous config saved to /var/cache/conftool/dbconfig/20250127-230002-marostegui.json
[23:02:28] <wikibugs>	 (03Abandoned) 10Reedy: Improve error summary [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110757 (https://phabricator.wikimedia.org/T381333) (owner: 10Reedy)
[23:02:31] <wikibugs>	 (03Abandoned) 10Reedy: Fix UW error summary [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110755 (https://phabricator.wikimedia.org/T383182) (owner: 10Reedy)
[23:06:50] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add the mw-misc service_proxy listener to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1114470 (https://phabricator.wikimedia.org/T384329) (owner: 10Btullis)
[23:08:38] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:14:19] <wikibugs>	 (03CR) 10Urbanecm: [C:04-1] Add configurable MinimumTasksPerTopic (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime)
[23:15:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T384592)', diff saved to https://phabricator.wikimedia.org/P72554 and previous config saved to /var/cache/conftool/dbconfig/20250127-231509-marostegui.json
[23:15:14] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[23:22:35] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10499217 (10VRiley-WMF) The servers were getting the IP address from private 1-C and private 1-D, and not from th...
[23:51:31] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer