[00:02:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P72318 and previous config saved to /var/cache/conftool/dbconfig/20250124-000200-marostegui.json
[00:05:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:17:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P72319 and previous config saved to /var/cache/conftool/dbconfig/20250124-001708-marostegui.json
[00:31:42] <icinga-wm>	 RECOVERY - Disk space on arclamp1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops
[00:31:48] <icinga-wm>	 RECOVERY - Disk space on arclamp2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops
[00:32:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T384592)', diff saved to https://phabricator.wikimedia.org/P72320 and previous config saved to /var/cache/conftool/dbconfig/20250124-003215-marostegui.json
[00:32:19] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[00:32:31] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2180.codfw.wmnet with reason: Maintenance
[00:32:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72321 and previous config saved to /var/cache/conftool/dbconfig/20250124-003237-marostegui.json
[00:34:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:38:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113893
[00:38:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113893 (owner: 10TrainBranchBot)
[00:41:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72322 and previous config saved to /var/cache/conftool/dbconfig/20250124-004102-marostegui.json
[00:41:07] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[00:52:55] <wikibugs>	 (03CR) 10Ladsgroup: "Is this needed? I think we can abandon this. The production has moved on for ten weekly releases." [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088329 (https://phabricator.wikimedia.org/T379150) (owner: 10Jforrester)
[00:54:21] <wikibugs>	 (03CR) 10Ladsgroup: "A lot of time has passed, I think you still need this, but want to double check before deploying." [puppet] - 10https://gerrit.wikimedia.org/r/1072268 (owner: 10Jforrester)
[00:56:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P72323 and previous config saved to /var/cache/conftool/dbconfig/20250124-005609-marostegui.json
[00:59:08] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113893 (owner: 10TrainBranchBot)
[01:01:58] <wikibugs>	 (03PS3) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1104740
[01:08:55] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113895
[01:08:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113895 (owner: 10TrainBranchBot)
[01:11:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P72324 and previous config saved to /var/cache/conftool/dbconfig/20250124-011116-marostegui.json
[01:21:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491086 (10phaultfinder)
[01:26:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72325 and previous config saved to /var/cache/conftool/dbconfig/20250124-012623-marostegui.json
[01:26:28] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[01:26:39] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2193.codfw.wmnet with reason: Maintenance
[01:26:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T384592)', diff saved to https://phabricator.wikimedia.org/P72326 and previous config saved to /var/cache/conftool/dbconfig/20250124-012645-marostegui.json
[01:28:46] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113895 (owner: 10TrainBranchBot)
[01:33:46] <icinga-wm>	 PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[01:34:36] <icinga-wm>	 RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Tue 18 Feb 2025 07:56:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[01:34:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T384592)', diff saved to https://phabricator.wikimedia.org/P72327 and previous config saved to /var/cache/conftool/dbconfig/20250124-013444-marostegui.json
[01:34:49] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[01:49:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P72328 and previous config saved to /var/cache/conftool/dbconfig/20250124-014951-marostegui.json
[02:04:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P72329 and previous config saved to /var/cache/conftool/dbconfig/20250124-020458-marostegui.json
[02:17:42] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:17:58] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:18:40] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.788 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:18:56] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53369 bytes in 8.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:20:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T384592)', diff saved to https://phabricator.wikimedia.org/P72330 and previous config saved to /var/cache/conftool/dbconfig/20250124-022005-marostegui.json
[02:20:09] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[02:20:21] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2197.codfw.wmnet with reason: Maintenance
[02:31:58] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:32:48] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:34:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491141 (10phaultfinder)
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:48:45] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2217.codfw.wmnet with reason: Maintenance
[02:48:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T384592)', diff saved to https://phabricator.wikimedia.org/P72331 and previous config saved to /var/cache/conftool/dbconfig/20250124-024851-marostegui.json
[02:48:56] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[03:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:16:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:16:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T384592)', diff saved to https://phabricator.wikimedia.org/P72332 and previous config saved to /var/cache/conftool/dbconfig/20250124-031655-marostegui.json
[03:17:00] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[03:32:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P72333 and previous config saved to /var/cache/conftool/dbconfig/20250124-033202-marostegui.json
[03:34:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491170 (10phaultfinder)
[03:47:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P72334 and previous config saved to /var/cache/conftool/dbconfig/20250124-034709-marostegui.json
[04:02:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T384592)', diff saved to https://phabricator.wikimedia.org/P72335 and previous config saved to /var/cache/conftool/dbconfig/20250124-040216-marostegui.json
[04:02:23] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[04:02:33] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2224.codfw.wmnet with reason: Maintenance
[04:02:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T384592)', diff saved to https://phabricator.wikimedia.org/P72336 and previous config saved to /var/cache/conftool/dbconfig/20250124-040239-marostegui.json
[04:05:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:23:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:29:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T384592)', diff saved to https://phabricator.wikimedia.org/P72337 and previous config saved to /var/cache/conftool/dbconfig/20250124-042942-marostegui.json
[04:29:47] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[04:34:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:44:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P72338 and previous config saved to /var/cache/conftool/dbconfig/20250124-044449-marostegui.json
[04:59:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P72339 and previous config saved to /var/cache/conftool/dbconfig/20250124-045955-marostegui.json
[05:15:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T384592)', diff saved to https://phabricator.wikimedia.org/P72340 and previous config saved to /var/cache/conftool/dbconfig/20250124-051503-marostegui.json
[05:15:07] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[05:15:19] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2229.codfw.wmnet with reason: Maintenance
[05:15:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2229 (T384592)', diff saved to https://phabricator.wikimedia.org/P72341 and previous config saved to /var/cache/conftool/dbconfig/20250124-051525-marostegui.json
[05:16:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:16:58] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:17:48] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53368 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:19:42] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 7.499 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:20:58] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:21:58] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53368 bytes in 9.858 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:22:07] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[05:22:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:27:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[05:27:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:28:39] <jinxer-wm>	 RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:35:05] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1222: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113911
[05:35:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72342 and previous config saved to /var/cache/conftool/dbconfig/20250124-053535-root.json
[05:35:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1222: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113911 (owner: 10Marostegui)
[05:42:14] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113912
[05:42:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72343 and previous config saved to /var/cache/conftool/dbconfig/20250124-054227-root.json
[05:42:38] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113912 (owner: 10Marostegui)
[05:43:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T384592)', diff saved to https://phabricator.wikimedia.org/P72344 and previous config saved to /var/cache/conftool/dbconfig/20250124-054335-marostegui.json
[05:43:39] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[05:45:07] <wikibugs>	 (03Abandoned) 10Marostegui: mariadb: productionize db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1087902 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb)
[05:47:37] <wikibugs>	 (03PS1) 10Marostegui: backup1002.cnf.erb: Replace es1022 with es1043 [puppet] - 10https://gerrit.wikimedia.org/r/1113913 (https://phabricator.wikimedia.org/T382569)
[05:49:07] <wikibugs>	 (03CR) 10Marostegui: "This is a NOOP. In any case, es1043 was cloned from es1022, so the data is the same and  are the users. dump user exists on es1043." [puppet] - 10https://gerrit.wikimedia.org/r/1113913 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui)
[05:49:38] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] backup1002.cnf.erb: Replace es1022 with es1043 [puppet] - 10https://gerrit.wikimedia.org/r/1113913 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui)
[05:50:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72345 and previous config saved to /var/cache/conftool/dbconfig/20250124-055042-root.json
[05:57:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72346 and previous config saved to /var/cache/conftool/dbconfig/20250124-055733-root.json
[05:58:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P72347 and previous config saved to /var/cache/conftool/dbconfig/20250124-055842-marostegui.json
[06:04:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:05:23] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:05:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72348 and previous config saved to /var/cache/conftool/dbconfig/20250124-060547-root.json
[06:06:13] <wikibugs>	 (03CR) 10Kevin Bazira: EventStreamConfig: Add mediawiki.article_country_prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira)
[06:07:02] <wikibugs>	 (03CR) 10Kevin Bazira: "as discussed in yesterday's meeting, we will begin by producing the weighted tags stream: https://phabricator.wikimedia.org/T382295#104898" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira)
[06:12:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72349 and previous config saved to /var/cache/conftool/dbconfig/20250124-061238-root.json
[06:13:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P72350 and previous config saved to /var/cache/conftool/dbconfig/20250124-061348-marostegui.json
[06:20:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72351 and previous config saved to /var/cache/conftool/dbconfig/20250124-062052-root.json
[06:27:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72352 and previous config saved to /var/cache/conftool/dbconfig/20250124-062744-root.json
[06:28:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T384592)', diff saved to https://phabricator.wikimedia.org/P72353 and previous config saved to /var/cache/conftool/dbconfig/20250124-062855-marostegui.json
[06:29:01] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[06:35:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72354 and previous config saved to /var/cache/conftool/dbconfig/20250124-063557-root.json
[06:42:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72355 and previous config saved to /var/cache/conftool/dbconfig/20250124-064249-root.json
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T0700)
[07:09:39] <jinxer-wm>	 RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:16:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:24:42] <Jhs>	 Hi operations folks! Automatic citations in VE are broken on most wikis. Is that worth an emergency deploy on a Friday? https://phabricator.wikimedia.org/T384661
[07:34:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491304 (10phaultfinder)
[07:35:19] <icinga-wm>	 PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 543MiB (3% inode=38%): /tmp 543MiB (3% inode=38%): /var/tmp 543MiB (3% inode=38%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops
[07:37:54] <wikibugs>	 (03PS1) 10Mvolz: Revert "Warn if 'preprint', 'dataset', or 'standard' key is missing" [extensions/Citoid] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113948 (https://phabricator.wikimedia.org/T384661)
[07:39:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Citoid] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113948 (https://phabricator.wikimedia.org/T384661) (owner: 10Mvolz)
[07:42:11] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[07:43:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:44:35] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1175 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:47:00] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove es1022 [puppet] - 10https://gerrit.wikimedia.org/r/1113949 (https://phabricator.wikimedia.org/T384566)
[07:47:13] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 244.11 ms
[07:47:39] <Jhs>	 thcipriani: brennen: help! I'd like to do an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Citoid/+/1113948/ -- context is T384661
[07:47:40] <stashbot>	 T384661: Citoid's automatic reference feature is broken in multiple wikis - https://phabricator.wikimedia.org/T384661
[07:48:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1022.eqiad.wmnet
[07:48:47] <Jhs>	 (following the template in Deployments/Emergencies. Not sure if it's UBN level, but it is a pretty prominent feature that many rely on, that would stay broken for most languages over the whole weekend if it isn't deployed now.)
[07:49:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove es1022 [puppet] - 10https://gerrit.wikimedia.org/r/1113949 (https://phabricator.wikimedia.org/T384566) (owner: 10Marostegui)
[07:51:50] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: os upgrade
[07:53:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2022.codfw.wmnet with OS bookworm
[07:53:50] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS bookworm
[07:54:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[07:54:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491322 (10phaultfinder)
[07:55:19] <icinga-wm>	 RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops
[07:55:33] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] backup1002.cnf.erb: Replace es1022 with es1043 [puppet] - 10https://gerrit.wikimedia.org/r/1113913 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui)
[07:57:53] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[07:58:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[07:58:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:58:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1022.eqiad.wmnet
[07:58:56] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1022.eqiad.wmnet - https://phabricator.wikimedia.org/T384566#10491323 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts: `es1022.eqiad.wmnet` - es1022.eqiad.wmnet (**PASS**)   - Downtimed...
[07:58:58] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1022.eqiad.wmnet - https://phabricator.wikimedia.org/T384566#10491324 (10Marostegui) a:05Marostegui→03None
[07:59:07] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1022.eqiad.wmnet - https://phabricator.wikimedia.org/T384566#10491330 (10Marostegui) This is ready for #dc-ops
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T0800)
[08:03:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:04:47] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1225.eqiad.wmnet with OS bookworm
[08:04:52] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove es1023 [puppet] - 10https://gerrit.wikimedia.org/r/1113950 (https://phabricator.wikimedia.org/T384679)
[08:05:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:06:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1023 [puppet] - 10https://gerrit.wikimedia.org/r/1113950 (https://phabricator.wikimedia.org/T384679) (owner: 10Marostegui)
[08:07:37] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1175 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:08:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1044 to es5 master', diff saved to https://phabricator.wikimedia.org/P72356 and previous config saved to /var/cache/conftool/dbconfig/20250124-080804-root.json
[08:08:43] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:08:43] <marostegui>	 !log Remove es1023 from es5 eqiad dbmaint T384679
[08:08:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:47] <stashbot>	 T384679: decommission es1023.eqiad.wmnet - https://phabricator.wikimedia.org/T384679
[08:10:24] <wikibugs>	 (03PS1) 10Marostegui: es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113951 (https://phabricator.wikimedia.org/T384679)
[08:10:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113951 (https://phabricator.wikimedia.org/T384679) (owner: 10Marostegui)
[08:11:03] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: os upgrade
[08:18:41] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[08:18:57] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 228.76 ms
[08:21:34] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1225.eqiad.wmnet with reason: host reimage
[08:25:02] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1225.eqiad.wmnet with reason: host reimage
[08:25:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:29:02] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2214.codfw.wmnet with reason: Maintenance
[08:30:10] <wikibugs>	 (03CR) 10Jelto: "The point I tried to make is: when the Kubernetes cluster is upgraded to `1.31` the `kubectl` image has to be updated as well, otherwise t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[08:30:22] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1216.eqiad.wmnet with OS bookworm
[08:34:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:36:14] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[08:36:32] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:36:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T384592)', diff saved to https://phabricator.wikimedia.org/P72357 and previous config saved to /var/cache/conftool/dbconfig/20250124-083638-marostegui.json
[08:36:44] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[08:42:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2022.codfw.wmnet with reason: host reimage
[08:46:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2022.codfw.wmnet with reason: host reimage
[08:47:25] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1216.eqiad.wmnet with reason: host reimage
[08:49:08] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1225.eqiad.wmnet with OS bookworm
[08:51:26] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1216.eqiad.wmnet with reason: host reimage
[09:02:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10491413 (10MoritzMuehlenhoff)
[09:05:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2022.codfw.wmnet with OS bookworm
[09:05:58] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491415 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS bookworm completed: - ganeti202...
[09:10:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet
[09:14:07] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1216.eqiad.wmnet with OS bookworm
[09:18:14] <wikibugs>	 (03PS1) 10Vgutierrez: liberica: Add katran config settings [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450)
[09:18:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet
[09:20:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2022.codfw.wmnet to cluster codfw and group B
[09:21:27] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2022.codfw.wmnet to cluster codfw and group B
[09:21:59] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10491430 (10MoritzMuehlenhoff)
[09:24:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:29:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti2020 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113963
[09:32:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:36:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T384592)', diff saved to https://phabricator.wikimedia.org/P72358 and previous config saved to /var/cache/conftool/dbconfig/20250124-093614-marostegui.json
[09:36:19] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[09:36:50] <wikibugs>	 (03PS2) 10Vgutierrez: liberica: Add katran config settings [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450)
[09:36:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Remember to run dbctl config commit -m once puppet has been merged" [puppet] - 10https://gerrit.wikimedia.org/r/1113849 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto)
[09:43:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:43:39] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow1002.eqiad.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[09:43:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10491452 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=43ff15dd-e256-46b3-aea6-882240b9fe64) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th...
[09:50:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:51:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P72359 and previous config saved to /var/cache/conftool/dbconfig/20250124-095121-marostegui.json
[09:55:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:57:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:01:44] <logmsgbot>	 !log mnz@deploy2002 Started deploy [airflow-dags/research@ba61f77]: (no justification provided)
[10:01:54] <logmsgbot>	 !log mnz@deploy2002 Finished deploy [airflow-dags/research@ba61f77]: (no justification provided) (duration: 00m 12s)
[10:02:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:03:33] <icinga-wm>	 RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 83, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:05:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:05:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:06:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P72360 and previous config saved to /var/cache/conftool/dbconfig/20250124-100628-marostegui.json
[10:09:43] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:10:05] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:10:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:11:33] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:11:57] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:12:27] <logmsgbot>	 !log mnz@deploy2002 Started deploy [airflow-dags/research@95b14c7]: (no justification provided)
[10:13:01] <logmsgbot>	 !log mnz@deploy2002 Finished deploy [airflow-dags/research@95b14c7]: (no justification provided) (duration: 00m 43s)
[10:19:31] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "thanks a lot for adding the currently unused receivers, this looks good now!" [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn)
[10:21:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T384592)', diff saved to https://phabricator.wikimedia.org/P72361 and previous config saved to /var/cache/conftool/dbconfig/20250124-102135-marostegui.json
[10:21:40] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[10:21:51] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[10:21:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72362 and previous config saved to /var/cache/conftool/dbconfig/20250124-102157-marostegui.json
[10:30:18] <wikibugs>	 (03CR) 10Jelto: [C:03+2] "I'll merge this to test the change and have a bit more visibility for current Gerrit incidents. Thanks for figuring out the route and rece" [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn)
[10:31:39] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:33:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:35:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:46:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:46:15] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] instances.yaml: Remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113849 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto)
[10:47:41] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: Remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113849 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto)
[10:50:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2140 from dbctl T384480', diff saved to https://phabricator.wikimedia.org/P72363 and previous config saved to /var/cache/conftool/dbconfig/20250124-105029-fceratto.json
[10:50:34] <stashbot>	 T384480: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480
[11:05:40] <wikibugs>	 (03PS1) 10Federico Ceratto: site.pp, db2140.yaml: remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113967 (https://phabricator.wikimedia.org/T384480)
[11:06:49] <wikibugs>	 (03Abandoned) 10Federico Ceratto: site.pp remove "Future" as db2233 is already the master [puppet] - 10https://gerrit.wikimedia.org/r/1113436 (owner: 10Federico Ceratto)
[11:18:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72365 and previous config saved to /var/cache/conftool/dbconfig/20250124-111834-marostegui.json
[11:18:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:18:39] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491747 (10MoritzMuehlenhoff)
[11:18:39] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[11:20:29] <wikibugs>	 (03PS3) 10JMeybohm: Import upstream release 1.24.2 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1113460 (https://phabricator.wikimedia.org/T341984)
[11:23:37] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:24:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491761 (10phaultfinder)
[11:25:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet
[11:25:51] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491762 (10ops-monitoring-bot) Draining ganeti2020.codfw.wmnet of running VMs
[11:29:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet
[11:33:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd2001.codfw.wmnet to drbd
[11:33:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P72366 and previous config saved to /var/cache/conftool/dbconfig/20250124-113341-marostegui.json
[11:33:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491808 (10ops-monitoring-bot) VM ml-etcd2001.codfw.wmnet switching disk type to drbd
[11:42:00] <wikibugs>	 (03Abandoned) 10Jelto: gerrit: change blackbox checks to collaboration-services/task [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto)
[11:42:56] <wikibugs>	 (03CR) 10Jelto: "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113800 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[11:43:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd2001.codfw.wmnet to drbd
[11:43:37] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:43:37] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:44:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet
[11:44:57] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491906 (10ops-monitoring-bot) Draining ganeti2020.codfw.wmnet of running VMs
[11:45:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet
[11:45:48] <wikibugs>	 (03PS1) 10Kamila Součková: wikikube: rename parse10[07-12] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113974 (https://phabricator.wikimedia.org/T365571)
[11:46:41] <icinga-wm>	 PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:46:56] <wikibugs>	 (03Abandoned) 10Lucas Werkmeister (WMDE): Increase nonexistent item ID for Commons constraint checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112054 (owner: 10Lucas Werkmeister (WMDE))
[11:47:41] <icinga-wm>	 RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 79, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:48:27] <wikibugs>	 (03PS1) 10Jelto: apt: update gitlab-ce to 17.7 [puppet] - 10https://gerrit.wikimedia.org/r/1113975 (https://phabricator.wikimedia.org/T379598)
[11:48:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd2001.codfw.wmnet to plain
[11:48:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P72367 and previous config saved to /var/cache/conftool/dbconfig/20250124-114848-marostegui.json
[11:48:54] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491938 (10ops-monitoring-bot) VM ml-etcd2001.codfw.wmnet switching disk type to plain
[11:49:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd2001.codfw.wmnet to plain
[11:51:15] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:57:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1134 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T0800)
[12:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GitLab version upgrades . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T1200).
[12:03:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72368 and previous config saved to /var/cache/conftool/dbconfig/20250124-120355-marostegui.json
[12:04:00] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[12:04:11] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[12:04:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T384592)', diff saved to https://phabricator.wikimedia.org/P72369 and previous config saved to /var/cache/conftool/dbconfig/20250124-120417-marostegui.json
[12:07:17] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10491980 (10MoritzMuehlenhoff)
[12:15:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:17:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:18:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T384592)', diff saved to https://phabricator.wikimedia.org/P72370 and previous config saved to /var/cache/conftool/dbconfig/20250124-121848-marostegui.json
[12:18:54] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[12:20:31] <Lucas_WMDE>	 FTR, I got some support voices for the emergency deploy Jhs requested in the security channel, so I’m going to go ahead with it
[12:20:56] <Lucas_WMDE>	 as Citoid is fairly important, and the risk for serious breakage from a JS change should be fairly low
[12:21:06] <Lucas_WMDE>	 if anyone objects, you have ca. 10 minutes to stop me while gate-and-submit runs :)
[12:21:14] <Jhs>	 cc mvolz too, so they're aware :)
[12:21:33] <Lucas_WMDE>	 oh, apparently there’s actually a window right now ^^
[12:21:36] <Lucas_WMDE>	 jouncebot: now
[12:21:36] <jouncebot>	 For the next 19 hour(s) and 38 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T0800)
[12:21:36] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T1200)
[12:21:55] <Lucas_WMDE>	 jelto, arnoldokoth, mutante: okay for me to do a mediawiki deploy during the gitlab upgrade?
[12:22:01] <Lucas_WMDE>	 (I’m gonna assume yes unless I hear otherwise)
[12:22:20] <Lucas_WMDE>	 also cc thcipriani and brennen for the emergency deploy per due process and stuff ^^
[12:22:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Citoid] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113948 (https://phabricator.wikimedia.org/T384661) (owner: 10Mvolz)
[12:26:42] <wikibugs>	 (03PS1) 10JMeybohm: CI: Ensure admin checks don't run unnecessary template calls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113979
[12:26:42] <wikibugs>	 (03PS1) 10JMeybohm: CI: Fix helm errors hiding behind YAML parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980
[12:28:42] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:30:44] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Update cert-manager to 1.16.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113801 (owner: 10JMeybohm)
[12:30:52] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Update coredns to 1.11.3 / coredns helm chart 1.37.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113454 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[12:31:27] <effie>	 Jhs: is there a dashboard/graph  where it is visible that citoid is not working well? we may consider adding an alert there 
[12:31:28] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Update istio to 1.24.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113507 (https://phabricator.wikimedia.org/T373526) (owner: 10JMeybohm)
[12:31:34] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Update coredns to 1.11.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113445 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[12:31:54] <Jhs>	 effie, not that I know of. I became aware of the issue because sjoerd mentioned it in a Discord chat
[12:32:29] <effie>	 Jhs: in that case, it sounds like we are missing some visibility there
[12:33:05] <jelto>	 Lucas_WMDE: yes, no GitLab upgrade today
[12:33:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Warn if 'preprint', 'dataset', or 'standard' key is missing" [extensions/Citoid] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113948 (https://phabricator.wikimedia.org/T384661) (owner: 10Mvolz)
[12:33:10] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Pin cert-manager version on all clustes to 1.10.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113800 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[12:33:12] <Lucas_WMDE>	 ok thanks!
[12:33:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1113948|Revert "Warn if 'preprint', 'dataset', or 'standard' key is missing" (T384661)]]
[12:33:43] <stashbot>	 T384661: Citoid's automatic reference feature is broken in multiple wikis - https://phabricator.wikimedia.org/T384661
[12:33:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P72371 and previous config saved to /var/cache/conftool/dbconfig/20250124-123355-marostegui.json
[12:33:57] <Lucas_WMDE>	 I’m not seeing any useful-looking charts in https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-2d&to=now
[12:34:16] <Lucas_WMDE>	 (but I think that’s also the dashboard for Citoid the service rather than Citoid the extension? not sure)
[12:34:53] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] wikikube: rename parse10[07-12] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113974 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[12:35:41] <Jhs>	 effie, the change that's being reverted has two effects: 1: disable the automatic tab (bad), but 2: add an mw.log.warn(). Maybe that mw.log is actually logged somewhere and can be monitored?
[12:36:04] <Jhs>	 line 52 here: https://gerrit.wikimedia.org/g/mediawiki/extensions/Citoid/+/c6c50c6f9075c23e3735f3b422f1b8ae9866d44e/modules/ve/ve.ui.Citoid.init.js
[12:37:04] <Lucas_WMDE>	 I doubt it… I think mw.log.warn is really just the same as console.warn https://doc.wikimedia.org/mediawiki-core/master/js/startup_mediawiki.js.html#line149
[12:37:38] <Lucas_WMDE>	 (it’s not like mw.logdeprecate / makeDeprecated which includes some tracking)
[12:37:46] <Lucas_WMDE>	 (*mw.log.deprecate)
[12:38:31] <Jhs>	 mhm, right
[12:38:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 mvolz, lucaswerkmeister-wmde: Backport for [[gerrit:1113948|Revert "Warn if 'preprint', 'dataset', or 'standard' key is missing" (T384661)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:38:37] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:38:45] <Lucas_WMDE>	 Jhs: can you test the revert on WikimediaDebug?
[12:39:04] <Lucas_WMDE>	 (if there’s still a wiki that didn’t fix the config – I only see the fixed wikis in the task description)
[12:40:02] <Jhs>	 Lucas_WMDE, tested, works like it should 👍 
[12:40:05] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 mvolz, lucaswerkmeister-wmde: Continuing with sync
[12:40:07] <Lucas_WMDE>	 nice \o/
[12:40:57] <Jhs>	 Tested in elwiki and jvwiki, which had a disabled automatic tab before I was on WikimediaDebug, and it was working again after. Also tested on nowiki for control just to check that it still works there too. So all good 👍 
[12:40:59] <wikibugs>	 (03Merged) 10jenkins-bot: Pin cert-manager version on all clustes to 1.10.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113800 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[12:41:05] <effie>	 Jhs: I will reply on the relevant task, but it sounds like it is not ideal severe issues as this one going undetected until someone complains
[12:41:15] <wikibugs>	 (03Merged) 10jenkins-bot: Update cert-manager to 1.16.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113801 (owner: 10JMeybohm)
[12:41:17] <Jhs>	 effie, mhm, agree
[12:41:30] <wikibugs>	 (03Merged) 10jenkins-bot: Update coredns to 1.11.3 / coredns helm chart 1.37.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113454 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[12:42:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:46:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113948|Revert "Warn if 'preprint', 'dataset', or 'standard' key is missing" (T384661)]] (duration: 13m 04s)
[12:46:48] <stashbot>	 T384661: Citoid's automatic reference feature is broken in multiple wikis - https://phabricator.wikimedia.org/T384661
[12:46:49] * Lucas_WMDE done deploying
[12:49:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P72372 and previous config saved to /var/cache/conftool/dbconfig/20250124-124902-marostegui.json
[12:49:13] <wikibugs>	 (03PS1) 10Cyndywikime: Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714)
[12:49:53] <effie>	 Jhs: are we aware how long this has been broken for?
[12:50:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime)
[12:50:30] <Jhs>	 effie, not very long I think. Sjoerd mentioned it in a Discord yesterday. I personally used it on Monday when holding an editing course. So at most 3 days?
[12:50:57] <Jhs>	 And not on all wikis – so enwiki was not affected, because it already had the config items that became obligatory
[12:52:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:52:33] <Jhs>	 these are the wikis that would be affected, theoretically: https://global-search.toolforge.org/?q=.&regex=1&namespaces=8&title=Citoid-template-type-map.json (But not all of them were in practice, because that json page doesn't actually have an effect on all of them – some of the ones I tested didn't have the automatic/manual/reuse tabs at all, but rather just a dropdown. I'm not sure why.)
[12:58:37] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:01:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Check link from msw1-eqiad et-0/1/0 to msw2-eqiad et-0/1/0 - https://phabricator.wikimedia.org/T384708 (10cmooney) 03NEW p:05Triage→03Low
[13:03:37] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:04:06] <wikibugs>	 (03CR) 10AOkoth: "Oooh. I should probably change this to the `latest` tag then and probably change the pull policy to always. From my conversation with Alex" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[13:04:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T384592)', diff saved to https://phabricator.wikimedia.org/P72373 and previous config saved to /var/cache/conftool/dbconfig/20250124-130409-marostegui.json
[13:04:14] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[13:04:25] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[13:04:28] <wikibugs>	 (03PS1) 10Jelto: gerrit: block Digital Ocean IP for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1113986 (https://phabricator.wikimedia.org/T384706)
[13:04:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T384592)', diff saved to https://phabricator.wikimedia.org/P72374 and previous config saved to /var/cache/conftool/dbconfig/20250124-130431-marostegui.json
[13:05:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113986 (https://phabricator.wikimedia.org/T384706) (owner: 10Jelto)
[13:05:49] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4860/co" [puppet] - 10https://gerrit.wikimedia.org/r/1113986 (https://phabricator.wikimedia.org/T384706) (owner: 10Jelto)
[13:06:14] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: block Digital Ocean IP for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1113986 (https://phabricator.wikimedia.org/T384706) (owner: 10Jelto)
[13:06:42] <wikibugs>	 (03PS11) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[13:09:14] <wikibugs>	 (03PS12) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[13:13:13] <mvolz>	 Thank you Lucas_WMDE and Jhs :). effie: It was broken on all wikis except for en wiki for however long 1.44.0-wmf.13 was deployed. So, 3 days for group 0, 1, 2 for group 1, about 1 day for group 2. And I agree there is a process problem here.
[13:13:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse[1007-1012].eqiad.wmnet
[13:13:26] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Fr-tech provision script to assign IPs and switch ports (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney)
[13:13:54] <wikibugs>	 (03PS13) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[13:14:20] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] wikikube: rename parse10[07-12] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113974 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[13:15:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney)
[13:16:53] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse[1007-1012].eqiad.wmnet
[13:17:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1007 to wikikube-worker1148
[13:17:54] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:18:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T384592)', diff saved to https://phabricator.wikimedia.org/P72375 and previous config saved to /var/cache/conftool/dbconfig/20250124-131812-marostegui.json
[13:18:17] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[13:18:37] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv
[13:18:38] <icinga-wm>	 e - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:18:38] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv
[13:18:38] <icinga-wm>	 e - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:20:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113975 (https://phabricator.wikimedia.org/T379598) (owner: 10Jelto)
[13:20:31] <wikibugs>	 (03PS10) 10Cathal Mooney: Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553)
[13:21:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1007 to wikikube-worker1148 - kamila@cumin1002"
[13:21:40] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1008 to wikikube-worker1149
[13:21:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1007 to wikikube-worker1148 - kamila@cumin1002"
[13:21:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:21:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1148
[13:22:00] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:22:47] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113752 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:23:01] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1148
[13:23:40] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1007 to wikikube-worker1148
[13:24:58] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10492166 (10elukey) From the megacli's perspective, the drive was `Unconfigured (bad)` and I was able to make it `Good` but then, as exp...
[13:25:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1008 to wikikube-worker1149 - kamila@cumin1002"
[13:25:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1009 to wikikube-worker1150
[13:26:04] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1008 to wikikube-worker1149 - kamila@cumin1002"
[13:26:04] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:26:04] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1149
[13:26:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:27:18] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1149
[13:27:57] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1008 to wikikube-worker1149
[13:29:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney)
[13:29:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1009 to wikikube-worker1150 - kamila@cumin1002"
[13:30:16] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1009 to wikikube-worker1150 - kamila@cumin1002"
[13:30:17] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:30:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1150
[13:30:23] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1010 to wikikube-worker1151
[13:30:43] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:30:43] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1216.eqiad.wmnet with reason: rebuilding tables
[13:31:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1150
[13:31:45] <wikibugs>	 (03Merged) 10jenkins-bot: Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney)
[13:32:08] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1009 to wikikube-worker1150
[13:32:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on parse1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:33:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P72376 and previous config saved to /var/cache/conftool/dbconfig/20250124-133319-marostegui.json
[13:33:22] <wikibugs>	 (03CR) 10Jelto: "I'd say usage of `latest` should be avoided in production, we just have to rebuild and bump the image version to make the image compatible" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[13:33:35] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[13:33:37] <logmsgbot>	 !log cmooney@cumin1002 END (ERROR) - Cookbook sre.netbox.update-extras (exit_code=97) rolling restart_daemons on A:netbox-canary
[13:34:03] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[13:34:15] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[13:34:20] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1010 to wikikube-worker1151 - kamila@cumin1002"
[13:34:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1011 to wikikube-worker1152
[13:34:37] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1010 to wikikube-worker1151 - kamila@cumin1002"
[13:34:38] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:34:38] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1151
[13:34:54] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:35:47] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1151
[13:36:26] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1010 to wikikube-worker1151
[13:37:00] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10492194 (10elukey) I also tried via the BMC's webui, that interestingly shows the disk in `Unconfigured (bad)` state (and I cannot neit...
[13:38:37] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:38:40] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1011 to wikikube-worker1152 - kamila@cumin1002"
[13:38:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1012 to wikikube-worker1153
[13:38:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1011 to wikikube-worker1152 - kamila@cumin1002"
[13:38:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:38:57] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1152
[13:39:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:40:04] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1152
[13:40:43] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1011 to wikikube-worker1152
[13:41:01] <wikibugs>	 06SRE, 06Data-Platform-SRE, 10superset.wikimedia.org: Degraded Superset functionality during a high-traffic incident - https://phabricator.wikimedia.org/T384301#10492210 (10Gehel) p:05Triage→03Medium
[13:41:30] <wikibugs>	 (03Abandoned) 10Nikerabbit: RecentChangesTranslationFilterHookHandler: Replace call to deprecated ChangeTags::getDisplayTableName() [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088329 (https://phabricator.wikimedia.org/T379150) (owner: 10Jforrester)
[13:41:37] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[13:42:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10492221 (10cmooney) >>! In T379072#10295326, @Volans wrote: > In our netbox config we have for the logging formatters: > ` > 'django.server': { >         '()': 'django.utils.log...
[13:42:46] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1012 to wikikube-worker1153 - kamila@cumin1002"
[13:43:03] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1012 to wikikube-worker1153 - kamila@cumin1002"
[13:43:03] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:43:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1153
[13:43:37] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:43:53] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[13:44:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1153
[13:44:24] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[13:44:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1012 to wikikube-worker1153
[13:45:04] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1148.eqiad.wmnet wikikube-worker1149.eqiad.wmnet wikikube-worker1150.eqiad.wmnet wikikube-worker1151.eqiad.wmnet wikikube-worker1152.eqiad.wmnet wikikube-worker1153.eqiad.wmnet on all recursors
[13:45:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1148.eqiad.wmnet wikikube-worker1149.eqiad.wmnet wikikube-worker1150.eqiad.wmnet wikikube-worker1151.eqiad.wmnet wikikube-worker1152.eqiad.wmnet wikikube-worker1153.eqiad.wmnet on all recursors
[13:48:05] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1148.eqiad.wmnet with OS bookworm
[13:48:09] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1149.eqiad.wmnet with OS bookworm
[13:48:10] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1148
[13:48:11] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1148
[13:48:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1149
[13:48:12] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1149
[13:48:18] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1150.eqiad.wmnet with OS bookworm
[13:48:21] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1150
[13:48:21] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1150
[13:48:25] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1151.eqiad.wmnet with OS bookworm
[13:48:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P72378 and previous config saved to /var/cache/conftool/dbconfig/20250124-134826-marostegui.json
[13:48:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1151
[13:48:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1151
[13:48:31] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1152.eqiad.wmnet with OS bookworm
[13:48:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1153.eqiad.wmnet with OS bookworm
[13:48:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1152
[13:48:39] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1152
[13:48:40] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1153
[13:48:40] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1153
[13:55:31] <wikibugs>	 (03PS1) 10Kamila Součková: wikikube: rename parse101[3-7] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113989 (https://phabricator.wikimedia.org/T365571)
[13:56:16] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1113460 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:57:51] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1113989 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[14:02:56] <wikibugs>	 (03PS2) 10JMeybohm: CI: Fix helm errors hiding behind YAML parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980
[14:02:58] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113474 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[14:03:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T384592)', diff saved to https://phabricator.wikimedia.org/P72379 and previous config saved to /var/cache/conftool/dbconfig/20250124-140333-marostegui.json
[14:03:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1148.eqiad.wmnet with reason: host reimage
[14:03:38] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[14:03:49] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[14:04:04] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[14:04:11] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1149.eqiad.wmnet with reason: host reimage
[14:04:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72380 and previous config saved to /var/cache/conftool/dbconfig/20250124-140410-marostegui.json
[14:04:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1153.eqiad.wmnet with reason: host reimage
[14:04:21] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1150.eqiad.wmnet with reason: host reimage
[14:04:27] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1152.eqiad.wmnet with reason: host reimage
[14:04:31] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1151.eqiad.wmnet with reason: host reimage
[14:05:12] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[14:05:28] <wikibugs>	 (03PS3) 10JMeybohm: CI: Fix helm errors hiding behind YAML parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980
[14:05:37] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Update cert-manager to 1.16.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113752 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[14:05:51] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[14:07:08] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1148.eqiad.wmnet with reason: host reimage
[14:07:56] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet
[14:10:16] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1149.eqiad.wmnet with reason: host reimage
[14:11:54] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet
[14:13:42] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1153.eqiad.wmnet with reason: host reimage
[14:17:38] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1150.eqiad.wmnet with reason: host reimage
[14:18:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] "+1 on the configuration part. As for the $REASONS, I think this predated the introduction of discovery.wmnet DNS RRs to begin with. IIRC, " [puppet] - 10https://gerrit.wikimedia.org/r/1113562 (https://phabricator.wikimedia.org/T383946) (owner: 10Dzahn)
[14:22:13] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1152.eqiad.wmnet with reason: host reimage
[14:24:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for sstefanova [puppet] - 10https://gerrit.wikimedia.org/r/1113993
[14:24:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove access for sstefanova [puppet] - 10https://gerrit.wikimedia.org/r/1113993 (owner: 10Muehlenhoff)
[14:25:32] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1148.eqiad.wmnet with OS bookworm
[14:25:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1151.eqiad.wmnet with reason: host reimage
[14:26:51] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove access for sstefanova [puppet] - 10https://gerrit.wikimedia.org/r/1113993
[14:28:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for sstefanova [puppet] - 10https://gerrit.wikimedia.org/r/1113993 (owner: 10Muehlenhoff)
[14:29:27] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1149.eqiad.wmnet with OS bookworm
[14:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10492343 (10phaultfinder)
[14:31:42] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10492346 (10RobH) >>! In T373993#10490884, @BCornwall wrote: > <cut for brevity> > The first dip on all the hosts was unrelated to anything I did - not sure what happened t...
[14:33:12] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1153.eqiad.wmnet with OS bookworm
[14:36:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10492369 (10cmooney) @VRiley-WMF ok so after a bit more back-and-forth I think we can finally trial this new script and see how it works:  https://netbox-next.wikimedia.org/ext...
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:55] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1150.eqiad.wmnet with OS bookworm
[14:40:00] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[14:43:25] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1152.eqiad.wmnet with OS bookworm
[14:45:17] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1151.eqiad.wmnet with OS bookworm
[14:45:47] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-serve-ctrl2001:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:45:54] <wikibugs>	 (03PS1) 10Tsevener: Add ios.article_link_interaction stream to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113996 (https://phabricator.wikimedia.org/T382031)
[14:46:07] <Emperor>	 !incidents
[14:46:08] <sirenbot>	 5627 (UNACKED)  [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw)
[14:46:12] <Emperor>	 !ack 5627
[14:46:13] <sirenbot>	 5627 (ACKED)  [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw)
[14:46:35] <Emperor>	 well, another null runbook link :-/
[14:46:47] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[14:47:31] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "To be run AFTER the decommissioning script has successfully finished." [puppet] - 10https://gerrit.wikimedia.org/r/1113967 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto)
[14:50:47] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl2001:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:51:12] <Emperor>	 OK, I wasn't getting very far with that (kube_env ml-serv codfw didn't work), but it seems to have resolved itself...
[14:52:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72381 and previous config saved to /var/cache/conftool/dbconfig/20250124-145222-marostegui.json
[14:52:27] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[14:53:47] <wikibugs>	 (03PS1) 10Sergio Gimeno: beta: increase growth tasks lookahead size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113997 (https://phabricator.wikimedia.org/T325990)
[14:54:11] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[14:55:13] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[14:55:31] <wikibugs>	 (03PS33) 10Arnaudb: gitlab_runner: migrate ferm rules to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1109726 (https://phabricator.wikimedia.org/T370677)
[14:55:31] <wikibugs>	 (03CR) 10Arnaudb: "This patch is supposed to be idempotent with the current state of firewall on runners. It adds things in `modules/nftables`, `modules/fire" [puppet] - 10https://gerrit.wikimedia.org/r/1109726 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb)
[14:56:16] <wikibugs>	 (03CR) 10Jelto: "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113979 (owner: 10JMeybohm)
[15:00:09] <wikibugs>	 (03PS2) 10Sergio Gimeno: beta: increase growth tasks lookahead size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113997 (https://phabricator.wikimedia.org/T325990)
[15:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:02:19] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[15:02:27] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:03:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113996 (https://phabricator.wikimedia.org/T382031) (owner: 10Tsevener)
[15:05:47] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-serve-ctrl2002:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:06:07] <Emperor>	 OK, it's paging again, anyone know about this service?
[15:06:28] <sobanski>	 Perhaps btullis? 
[15:06:35] <Emperor>	 it's supposedly a kubenetes thing, but if I check srv/deployment-charts/helmfile.d/services on deploy2002 thre's not anything called ml-serv there
[15:06:41] <sobanski>	 Or klausman 
[15:06:57] <Emperor>	 so I can't even run kube_env to get to the point where I might list pods or anything
[15:07:02] <sukhe>	 those would be my two guesses as well.
[15:07:22] <klausman>	 taking a look
[15:07:23] <Emperor>	 The only docs I've found are https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#ml-serve which aren't very enlightening
[15:07:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P72382 and previous config saved to /var/cache/conftool/dbconfig/20250124-150729-marostegui.json
[15:07:44] <Emperor>	 https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2002:6443 doesn't exist
[15:08:38] <akosiaris>	 Emperor: it's the control-plane for the ml-serve cluster
[15:08:44] <klausman>	 the kubelet on that machine restarted
[15:08:46] <klausman>	      Active: active (running) since Fri 2025-01-24 15:05:45 UTC; 2min 27s ago
[15:08:48] <jynus>	 I was about to disconnect, but an ml host had its disk full 
[15:08:48] <akosiaris>	 and you don't want service but ml-services
[15:08:57] <jynus>	 not sure if related
[15:09:17] <klausman>	 rpobably lab related (I suspect ml-lab1001?
[15:09:23] <jynus>	 ml-lab1001
[15:09:28] <jynus>	 DISK CRITICAL - free space: /srv 14564MiB (3% inode=94%):
[15:09:31] <klausman>	 yeah, that won't affect prod
[15:09:34] <jynus>	 ok
[15:09:37] <klausman>	 ty!
[15:09:38] <akosiaris>	 srv/deployment-charts/helmfile.d/ml-services 
[15:09:40] <akosiaris>	 that is 
[15:09:57] <Emperor>	 Oh.
[15:10:16] <akosiaris>	 I am not sure why we are paging for a single host of the control-plane though tbh
[15:10:33] <akosiaris>	 it has 3 fwiw
[15:10:47] <akosiaris>	 or 2 perhaps, I think ml is with 2
[15:10:47] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl2002:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:11:16] <klausman>	 Huh...
[15:11:19] <klausman>	 Eror while processing event ("/sys/fs/cgroup/system.slice/clean-confd-rundir.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/system.slice/clean-confd-rundir.service: no such file or directory
[15:11:21] <jynus>	 is it under control or is it flapping?
[15:11:32] <Emperor>	 it's been down-then-up twice recently
[15:11:34] <Emperor>	 !incidents
[15:11:34] <sirenbot>	 5628 (RESOLVED)  [2x] ProbeDown sre (ml-serve-ctrl2002:6443 probes/custom codfw)
[15:11:34] <sirenbot>	 5627 (RESOLVED)  [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw)
[15:11:38] <jynus>	 :-(
[15:11:53] <klausman>	 I'm keeping an eye on its logs
[15:12:19] <klausman>	 so far no restarts beyond the one at 15:05:45 UTC
[15:12:40] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] wikikube: rename parse101[3-7] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113989 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[15:12:41] <Emperor>	 klausman: previous page was at 14:45
[15:12:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse[1013-1017].eqiad.wmnet
[15:13:33] <klausman>	 Yeah, that was similar (errors about stufgf like "/sys/fs/cgroup/system.slice/clean-confd-rundir.service" not existing, and then giving up because it couldn't update the lease
[15:14:42] <klausman>	 disregard, the cgroup errors are too time-distant to be relevant
[15:15:17] <klausman>	 The relevant errors are about being unable to talk to the service (10.2.1.39:6443), getting an ECONN
[15:16:12] <akosiaris>	 it's from both hosts apparently. both -ctrl2001 and -ctrl2002 complained. And at around the same time
[15:17:29] <klausman>	 Not clear what would cause it. Maybe a netweork blip? They're VMs. Moritz made one of the associated etcd machines (a VM) be slightly higher latency due to DRBD during a VM move, but that was hours before
[15:17:37] <akosiaris>	 https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=k8s-mlserve&from=now-1h&to=now
[15:17:49] <akosiaris>	 it's visible in the graphs here
[15:18:09] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse[1013-1017].eqiad.wmnet
[15:18:15] <akosiaris>	 work latencies and api latencies jumped to way higher levels
[15:18:17] <wikibugs>	 (03PS1) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000
[15:18:50] <inflatador>	 is there a way to track VM migrations in ganeti (or otherwise)?
[15:19:03] <akosiaris>	 the autoregister controller spiked to 10s
[15:19:40] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:19:51] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1013 to wikikube-worker1154
[15:20:11] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:20:25] <akosiaris>	 inflatador: https://wikitech.wikimedia.org/wiki/Ganeti#View_the_job_queue
[15:20:35] <akosiaris>	 first command to find the job that is the migration you care for
[15:20:42] <akosiaris>	 second to get an overview of how it is going
[15:20:45] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv
[15:20:45] <icinga-wm>	 e - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:20:45] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv
[15:20:45] <icinga-wm>	 e - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:21:05] <inflatador>	 akosiaris thanks...getting to relive my former life as virt engineer ;)
[15:21:33] <akosiaris>	 :-). I assume I don't need to tell you then about the qemu monitor socket
[15:21:57] <akosiaris>	 but an info migrate command there should give you all the nitty gritty details
[15:22:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P72383 and previous config saved to /var/cache/conftool/dbconfig/20250124-152236-marostegui.json
[15:22:41] <akosiaris>	 klausman: for the first page, etcd request latencies skyrocketed to 15s, see https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=k8s-mlserve&from=now-1h&to=now&viewPanel=28
[15:23:18] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[15:23:51] <klausman>	 akosiaris: but why would it then cause a flipflop hours later?
[15:24:08] <akosiaris>	 hours? I count 20 minutes
[15:24:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm)
[15:24:18] <klausman>	 gah, UTC and DST and oh my
[15:24:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1013 to wikikube-worker1154 - kamila@cumin1002"
[15:24:40] <akosiaris>	 lol, you can say that again. 
[15:24:51] <klausman>	 So according to my IRC logs, Moritz did the move arounf 1300 my time, and it's now 1600
[15:25:01] <klausman>	 well, 1624
[15:25:05] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1014 to wikikube-worker1155
[15:25:09] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1013 to wikikube-worker1154 - kamila@cumin1002"
[15:25:09] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:25:10] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1154
[15:25:15] <klausman>	 but these alerts are more recent, no?
[15:25:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:25:28] <arnoldokoth>	 Should I open an IC doc?
[15:25:39] <akosiaris>	 First one ~45minutes old 
[15:25:50] <akosiaris>	 second is ~20 minutes old
[15:25:59] <klausman>	 that'd still be over two hours from VM move to first flop
[15:26:23] <akosiaris>	 the control plane VMs aren't collocated with the etcd VMs from what I remember
[15:26:27] <wikibugs>	 (03PS2) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000
[15:26:35] <akosiaris>	 but the etcd VMs also can't be migrated around anyway
[15:26:40] <inflatador>	 akosiaris actually, I don't think I've ever used that directly. Probably a crappy xenserver command that interfaces with it instead ;)
[15:26:45] <akosiaris>	 they just get rebooted when needed
[15:27:14] <akosiaris>	 which the etcd protocol should account for as well
[15:27:38] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2227,2229-2230].codfw.wmnet
[15:27:45] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10492502 (10ops-monitoring-bot) depool host wikikube-worker[2227,2229-2230].codfw.wmnet by jayme@cumin1002 with...
[15:27:54] <logmsgbot>	 !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker[2227,2229-2230].codfw.wmnet with reason: Depooled via sre.k8s.pool-depool-node
[15:27:59] <klausman>	 yeah, and there's three, once they have quorum, two can handle things
[15:28:39] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1154
[15:29:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1014 to wikikube-worker1155 - kamila@cumin1002"
[15:29:18] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1013 to wikikube-worker1154
[15:29:32] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1014 to wikikube-worker1155 - kamila@cumin1002"
[15:29:32] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:29:32] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1155
[15:29:32] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1015 to wikikube-worker1156
[15:29:40] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2227,2229-2230].codfw.wmnet
[15:29:52] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10492513 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 depool fo...
[15:29:53] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:30:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10492525 (10cmooney)
[15:30:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on parse1016:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:32:03] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1155
[15:32:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10492529 (10cmooney)
[15:32:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm)
[15:32:42] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1014 to wikikube-worker1155
[15:32:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10492531 (10cmooney)
[15:33:07] <icinga-wm>	 PROBLEM - BGP status on lsw1-d3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:33:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1015 to wikikube-worker1156 - kamila@cumin1002"
[15:33:37] <icinga-wm>	 PROBLEM - BGP status on lsw1-c6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:34:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10492540 (10cmooney)
[15:35:29] <jayme>	 kubernetes-codfw BGP errors expected, https://phabricator.wikimedia.org/T383709
[15:36:14] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1016 to wikikube-worker1157
[15:36:14] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1015 to wikikube-worker1156 - kamila@cumin1002"
[15:36:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:36:15] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1156
[15:36:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:36:46] <wikibugs>	 (03PS2) 10Cyndywikime: Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714)
[15:37:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72384 and previous config saved to /var/cache/conftool/dbconfig/20250124-153743-marostegui.json
[15:37:48] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[15:37:56] <wikibugs>	 (03PS1) 10Urbanecm: [testwiki] Babel: Enable CommunityConfiguration integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114002 (https://phabricator.wikimedia.org/T374348)
[15:37:58] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[15:38:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T384592)', diff saved to https://phabricator.wikimedia.org/P72385 and previous config saved to /var/cache/conftool/dbconfig/20250124-153805-marostegui.json
[15:38:16] <wikibugs>	 (03CR) 10Urbanecm: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114002 (https://phabricator.wikimedia.org/T374348) (owner: 10Urbanecm)
[15:38:59] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[15:39:03] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:39:37] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1156
[15:40:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1015 to wikikube-worker1156
[15:40:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1017 to wikikube-worker1158
[15:40:48] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1016 to wikikube-worker1157 - kamila@cumin1002"
[15:41:11] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1016 to wikikube-worker1157 - kamila@cumin1002"
[15:41:11] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:41:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1157
[15:41:13] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:42:12] <logmsgbot>	 !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Slavina Stefanova out of all services on: 1010 hosts
[15:42:20] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1157
[15:42:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1016 to wikikube-worker1157
[15:43:12] <logmsgbot>	 !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Slavina Stefanova out of all services on: 1221 hosts
[15:43:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:43:37] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:44:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1017 to wikikube-worker1158 - kamila@cumin1002"
[15:44:50] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1017 to wikikube-worker1158 - kamila@cumin1002"
[15:44:50] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:44:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1158
[15:45:12] <klausman>	 So the two flops were almost exactly 20m apart, and we've been 40m since the second one. Plus no errors in the kubelet logs of the two ctrl nodes. I am calling this tentatively fixed, but will keep an eye on things.
[15:45:20] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "See my previous comment requesting to verify the RSA-2048 certs are no more in use and that it is indeed fine to move to ECDSA. Given that" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall)
[15:46:13] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1158
[15:46:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1017 to wikikube-worker1158
[15:46:56] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1154.eqiad.wmnet wikikube-worker1155.eqiad.wmnet wikikube-worker1156.eqiad.wmnet wikikube-worker1157.eqiad.wmnet wikikube-worker1158.eqiad.wmnet on all recursors
[15:46:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1154.eqiad.wmnet wikikube-worker1155.eqiad.wmnet wikikube-worker1156.eqiad.wmnet wikikube-worker1157.eqiad.wmnet wikikube-worker1158.eqiad.wmnet on all recursors
[15:47:42] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:49:34] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1154.eqiad.wmnet with OS bookworm
[15:49:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1154
[15:49:37] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1154
[15:49:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1155.eqiad.wmnet with OS bookworm
[15:49:48] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1155
[15:49:48] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1155
[15:50:16] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1156.eqiad.wmnet with OS bookworm
[15:50:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1156
[15:50:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1156
[15:51:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T384592)', diff saved to https://phabricator.wikimedia.org/P72386 and previous config saved to /var/cache/conftool/dbconfig/20250124-155130-marostegui.json
[15:51:35] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[15:52:16] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1157.eqiad.wmnet with OS bookworm
[15:52:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1157
[15:52:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1157
[15:52:35] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1158.eqiad.wmnet with OS bookworm
[15:52:38] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1158
[15:52:39] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1158
[15:55:15] <wikibugs>	 (03CR) 10Vgutierrez: "our CDN has removed RSA certificates already and gerrit ciphersuites configuration enforcing >=TLSv1.2 already enforces the usage or a rea" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall)
[15:55:40] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:56:04] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:56:30] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:56:39] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:57:19] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:57:38] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:57:51] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:57:58] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[15:58:32] <wikibugs>	 (03PS1) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[15:58:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe)
[15:59:58] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[16:00:20] <logmsgbot>	 !log pt1979@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[16:00:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:01:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10492667 (10kamila)
[16:02:54] <wikibugs>	 (03PS2) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720)
[16:04:25] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013']
[16:04:43] <logmsgbot>	 !log pt1979@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013']
[16:04:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:05:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:05:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1155.eqiad.wmnet with reason: host reimage
[16:06:23] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1156.eqiad.wmnet with reason: host reimage
[16:06:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:06:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P72387 and previous config saved to /var/cache/conftool/dbconfig/20250124-160637-marostegui.json
[16:07:45] <wikibugs>	 (03PS2) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225)
[16:09:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1155.eqiad.wmnet with reason: host reimage
[16:11:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:13:33] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1156.eqiad.wmnet with reason: host reimage
[16:16:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2229
[16:16:24] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2229
[16:18:03] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Import upstream release 1.24.2 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1113460 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[16:19:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:20:28] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware needs work - https://phabricator.wikimedia.org/T384722#10492747 (10Andrew)
[16:20:39] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[16:20:48] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[16:21:06] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[16:21:14] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet']
[16:21:40] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1158.eqiad.wmnet with reason: host reimage
[16:21:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P72388 and previous config saved to /var/cache/conftool/dbconfig/20250124-162144-marostegui.json
[16:21:47] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1157.eqiad.wmnet with reason: host reimage
[16:21:53] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware needs work - https://phabricator.wikimedia.org/T384722#10492754 (10Andrew) ` andrew@cumin1002:~$ sudo cookbook sre.hardware.upgrade-firmware --new --c nic 'cloudcephosd1013.eqiad.wmnet'  Acquired lock for key /spicerack/locks/cookbooks/sr...
[16:22:10] <wikibugs>	 (03PS1) 10JMeybohm: Add bash-completion to Build-Depends [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1114008 (https://phabricator.wikimedia.org/T341984)
[16:22:39] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Add bash-completion to Build-Depends [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1114008 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[16:23:37] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:23:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2230
[16:24:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2230
[16:24:31] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1158.eqiad.wmnet with reason: host reimage
[16:26:27] <jayme>	 !log imported istioctl 1.24.2-1 to bullseye/bookworm-wikimedia T341984
[16:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:32] <stashbot>	 T341984: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984
[16:27:43] <wikibugs>	 (03PS1) 10Elukey: kserve: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114012 (https://phabricator.wikimedia.org/T369493)
[16:28:27] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1157.eqiad.wmnet with reason: host reimage
[16:28:37] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:32:11] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013']
[16:33:22] <logmsgbot>	 !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1013']
[16:33:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1156.eqiad.wmnet with OS bookworm
[16:34:08] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware needs work - https://phabricator.wikimedia.org/T384722#10492840 (10Andrew) 05Open→03Invalid papaul just tried and it worked for him, so maybe I was doing something silly?  The usage statement still needs work but I can probably fi...
[16:35:10] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye
[16:35:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2227
[16:35:36] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2227
[16:38:21] <icinga-wm>	 RECOVERY - BGP status on lsw1-d3-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:38:49] <icinga-wm>	 RECOVERY - BGP status on lsw1-c6-codfw.mgmt is OK: BGP OK - up: 36, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:43:04] <wikibugs>	 (03PS1) 10Vgutierrez: site,swift: Use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020)
[16:43:16] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2227,2229-2230].codfw.wmnet
[16:43:17] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10492935 (10Jhancock.wm)
[16:43:19] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker[2227,2229-2230].codfw.wmnet
[16:43:20] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker[2227,2229-2230].codfw.wmnet
[16:43:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2227,2229-2230].codfw.wmnet
[16:43:24] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10492939 (10ops-monitoring-bot) pool host wikikube-worker[2227,2229-2230].codfw.wmnet by jayme@cumin1002 with re...
[16:43:28] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10492940 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 pool for...
[16:43:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1158.eqiad.wmnet with OS bookworm
[16:46:04] <wikibugs>	 (03PS1) 10Elukey: knative-serving: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114016 (https://phabricator.wikimedia.org/T369493)
[16:47:12] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1154.eqiad.wmnet with OS bookworm
[16:47:17] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1157.eqiad.wmnet with OS bookworm
[16:51:03] <wikibugs>	 (03PS2) 10Vgutierrez: site,swift: Use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020)
[16:52:11] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez)
[16:54:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1154.eqiad.wmnet with OS bookworm
[16:54:47] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1154
[16:54:47] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1154
[16:57:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1155.eqiad.wmnet with OS bookworm
[17:01:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10493041 (10elukey) The cookbook was modified in August 2024, when we moved to Netbox 4: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1056989  And we have used it regu...
[17:09:09] <wikibugs>	 (03PS1) 10Scott French: mw-(api-ext|web): scale next to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114005 (https://phabricator.wikimedia.org/T383845)
[17:10:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10493058 (10elukey)
[17:10:21] <wikibugs>	 (03PS1) 10Scott French: mw-on-k8s: aggregate remaining alerts by release name [alerts] - 10https://gerrit.wikimedia.org/r/1114018 (https://phabricator.wikimedia.org/T384532)
[17:10:25] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1154.eqiad.wmnet with reason: host reimage
[17:12:17] <wikibugs>	 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10493063 (10cmooney) >>! In T379072#10493041, @elukey wrote: > And we have used it regularly: https://sal.toolforge.org/production?p=0&q=sre.netbox.update-extras&d=  Yep I've use...
[17:12:39] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/1114018 (https://phabricator.wikimedia.org/T384532) (owner: 10Scott French)
[17:12:48] <wikibugs>	 (03CR) 10Vgutierrez: "@mvernon@wikimedia.org let me know what you think, this is preparatory work to  migrate to IPIP inbound traffic in ms-fe instances" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez)
[17:13:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1154.eqiad.wmnet with reason: host reimage
[17:32:25] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1154.eqiad.wmnet with OS bookworm
[17:38:32] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383862#10493159 (10Jhancock.wm) 05Open→03Resolved
[17:47:57] <logmsgbot>	 !log mforns@deploy2002 Started deploy [airflow-dags/analytics@ebb3680]: bump up mediawiki reduced as part of temp accounts deployment
[17:48:36] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@ebb3680]: bump up mediawiki reduced as part of temp accounts deployment (duration: 01m 00s)
[17:56:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731 (10cmooney) 03NEW p:05Triage→03Low
[17:57:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2136.codfw.wmnet
[17:59:19] <marostegui>	 !log Removing db2136 from zarcillo T384479
[17:59:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:23] <stashbot>	 T384479: decommission db2136.codfw.wmnet - https://phabricator.wikimedia.org/T384479
[18:03:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[18:04:48] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[18:05:34] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] trafficserver: point spiderpig.wikimedia.org to deployment.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1113562 (https://phabricator.wikimedia.org/T383946) (owner: 10Dzahn)
[18:05:38] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1154-1158].eqiad.wmnet
[18:05:40] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1154-1158].eqiad.wmnet
[18:05:44] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:05:45] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2136.codfw.wmnet
[18:08:03] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2136.codfw.wmnet - https://phabricator.wikimedia.org/T384479#10493236 (10Marostegui) a:05FCeratto-WMF→03None
[18:08:38] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2136.codfw.wmnet - https://phabricator.wikimedia.org/T384479#10493242 (10Marostegui) This is ready for #dc-ops
[18:17:54] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "Thank you, Alexandros! I have the same memory, at some point there was some technical reason for that (because I had asked before) but I c" [puppet] - 10https://gerrit.wikimedia.org/r/1113562 (https://phabricator.wikimedia.org/T383946) (owner: 10Dzahn)
[18:24:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10493296 (10phaultfinder)
[18:33:59] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10493308 (10KFrancis) Hi all, the NDA is complete.  Thanks!
[18:48:01] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1233.eqiad.wmnet with reason: Maintenance
[18:48:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T384592)', diff saved to https://phabricator.wikimedia.org/P72390 and previous config saved to /var/cache/conftool/dbconfig/20250124-184807-marostegui.json
[18:48:12] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[18:53:50] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] apt: update gitlab-ce to 17.7 [puppet] - 10https://gerrit.wikimedia.org/r/1113975 (https://phabricator.wikimedia.org/T379598) (owner: 10Jelto)
[19:03:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:04:09] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "deployed, ran puppet on apt1002 and ran the reprepro checkupdate/update commands." [puppet] - 10https://gerrit.wikimedia.org/r/1113975 (https://phabricator.wikimedia.org/T379598) (owner: 10Jelto)
[19:06:26] <icinga-wm>	 PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:06:28] <icinga-wm>	 PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:08:08] <icinga-wm>	 PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:08:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:09:20] <icinga-wm>	 RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Tue 18 Feb 2025 07:56:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:09:24] <icinga-wm>	 RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:09:32] <wikibugs>	 (03PS7) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942)
[19:10:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "@Antoine, did you have any concerns for this one? I think we have sometimes mentally mixed this one up with the other key, the RSA key in " [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn)
[19:12:26] <icinga-wm>	 PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:12:28] <icinga-wm>	 PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:12:49] <mutante>	 I am going to remove that monitoring.^
[19:13:12] <mutante>	 the service is going away soon enough to ignore that.
[19:13:18] <sukhe>	 ah! 
[19:13:21] <sukhe>	 thanks
[19:13:23] <jinxer-wm>	 FIRING: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:13:35] <wikibugs>	 (03PS1) 10Dzahn: requesttracker: remove blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1114038
[19:15:01] <icinga-wm>	 RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 536 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:15:16] <icinga-wm>	 RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:15:18] <icinga-wm>	 RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Tue 18 Feb 2025 07:56:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:18:23] <jinxer-wm>	 RESOLVED: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:18:46] <mutante>	 not sure I got the right monitoring check yet.. sending to LONG downtime
[19:19:01] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 90 days, 0:00:00 on moscovium.eqiad.wmnet with reason: to be decomed
[19:22:47] <wikibugs>	 (03PS2) 10Dzahn: requesttracker: remove blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1114038 (https://phabricator.wikimedia.org/T384721)
[19:27:28] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] requesttracker: remove blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1114038 (https://phabricator.wikimedia.org/T384721) (owner: 10Dzahn)
[19:27:42] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 90 days, 0:00:00 on moscovium.eqiad.wmnet with reason: to be decomed
[19:34:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T384592)', diff saved to https://phabricator.wikimedia.org/P72391 and previous config saved to /var/cache/conftool/dbconfig/20250124-193404-marostegui.json
[19:34:09] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[19:43:37] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:46:30] <jinxer-wm>	 FIRING: [3x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[19:49:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P72392 and previous config saved to /var/cache/conftool/dbconfig/20250124-194911-marostegui.json
[19:51:30] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[20:04:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P72393 and previous config saved to /var/cache/conftool/dbconfig/20250124-200419-marostegui.json
[20:04:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10493696 (10phaultfinder)
[20:19:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T384592)', diff saved to https://phabricator.wikimedia.org/P72394 and previous config saved to /var/cache/conftool/dbconfig/20250124-201926-marostegui.json
[20:19:41] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1246.eqiad.wmnet with reason: Maintenance
[20:19:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T384592)', diff saved to https://phabricator.wikimedia.org/P72395 and previous config saved to /var/cache/conftool/dbconfig/20250124-201947-marostegui.json
[20:28:37] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:04:21] <wikibugs>	 (03CR) 10Jforrester: "Yeah, the bespoke legal situation for Wikifunctions hasn't changed." [puppet] - 10https://gerrit.wikimedia.org/r/1072268 (owner: 10Jforrester)
[21:05:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T384592)', diff saved to https://phabricator.wikimedia.org/P72396 and previous config saved to /var/cache/conftool/dbconfig/20250124-210515-marostegui.json
[21:05:21] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[21:08:56] <wikibugs>	 (03PS1) 10Zabe: Increase revision-slots cache expiry back to default for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114060 (https://phabricator.wikimedia.org/T183490)
[21:12:40] <logmsgbot>	 !log amastilovic@deploy2002 Started deploy [airflow-dags/platform_eng@3907ed7]: (no justification provided)
[21:13:14] <logmsgbot>	 !log amastilovic@deploy2002 Finished deploy [airflow-dags/platform_eng@3907ed7]: (no justification provided) (duration: 00m 35s)
[21:15:48] <logmsgbot>	 !log amastilovic@deploy2002 Started deploy [airflow-dags/platform_eng@3907ed7]: (no justification provided)
[21:15:57] <logmsgbot>	 !log amastilovic@deploy2002 Finished deploy [airflow-dags/platform_eng@3907ed7]: (no justification provided) (duration: 00m 10s)
[21:20:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P72397 and previous config saved to /var/cache/conftool/dbconfig/20250124-212023-marostegui.json
[21:24:13] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:33:43] <jinxer-wm>	 FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[21:35:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P72398 and previous config saved to /var/cache/conftool/dbconfig/20250124-213530-marostegui.json
[21:38:37] <jinxer-wm>	 RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:38:49] <jinxer-wm>	 RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[21:42:30] <logmsgbot>	 !log amastilovic@deploy2002 Started deploy [airflow-dags/platform_eng@ebb3680]: (no justification provided)
[21:42:42] <wikibugs>	 (03PS1) 10Gergő Tisza: Update CentralAuth multi-DC rules for SUL3 [puppet] - 10https://gerrit.wikimedia.org/r/1114070 (https://phabricator.wikimedia.org/T363695)
[21:43:00] <logmsgbot>	 !log amastilovic@deploy2002 Finished deploy [airflow-dags/platform_eng@ebb3680]: (no justification provided) (duration: 00m 31s)
[21:47:07] <brett>	 !log Testing thermal settings on cp7004 (T373993)
[21:47:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:11] <stashbot>	 T373993: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993
[21:49:54] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7004.magru.wmnet,service=cdn
[21:50:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T384592)', diff saved to https://phabricator.wikimedia.org/P72399 and previous config saved to /var/cache/conftool/dbconfig/20250124-215037-marostegui.json
[21:50:42] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[21:50:53] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[21:51:30] <logmsgbot>	 !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7004.magru.wmnet with reason: Thermal settings testing (T373993)
[22:02:09] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[22:05:49] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  cloudgw1003 - vriley@cumin1002"
[22:07:15] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  cloudgw1003 - vriley@cumin1002"
[22:07:15] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:08:16] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:08:28] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudgw1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:10:24] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet,service=(cdn|ats-be)
[22:10:35] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7003.magru.wmnet,service=(cdn|ats-be)
[22:10:41] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7008.magru.wmnet,service=(cdn|ats-be)
[22:10:55] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7006.magru.wmnet,service=(cdn|ats-be)
[22:11:15] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp700[2-4].magru.wmnet,service=(cdn|ats-be)
[22:11:24] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7010.magru.wmnet,service=(cdn|ats-be)
[22:11:27] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7015.magru.wmnet,service=(cdn|ats-be)
[22:11:43] <sukhe>	 !log pool bunch of cp7x in magru for ats-be that were depooled
[22:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:09] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7004.magru.wmnet,service=cdn
[22:18:32] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7004.magru.wmnet
[22:19:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10493957 (10phaultfinder)
[22:42:57] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2148.codfw.wmnet with reason: Maintenance
[22:43:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T384592)', diff saved to https://phabricator.wikimedia.org/P72401 and previous config saved to /var/cache/conftool/dbconfig/20250124-224303-marostegui.json
[22:43:08] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[22:43:25] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:04-1] "Typo – should be `wg`, not `wmg`." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113141 (https://phabricator.wikimedia.org/T378402) (owner: 10Pmiazga)
[22:52:49] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudgw1003
[22:54:51] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudgw1003
[22:55:07] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudgw1004
[22:56:18] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudgw1004
[23:04:18] <wikibugs>	 (03PS1) 10BCornwall: magru: Remove ats-be services from conftool [puppet] - 10https://gerrit.wikimedia.org/r/1114074
[23:06:00] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4861/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[23:08:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:18:28] <wikibugs>	 06SRE, 10MW-on-K8s: mwgrep cannot be used from a deployment host - https://phabricator.wikimedia.org/T384764 (10Urbanecm_WMF) 03NEW
[23:24:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10494101 (10phaultfinder)
[23:26:54] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:30:29] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudgw1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:34:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T384592)', diff saved to https://phabricator.wikimedia.org/P72402 and previous config saved to /var/cache/conftool/dbconfig/20250124-233407-marostegui.json
[23:34:12] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[23:36:54] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7004.magru.wmnet
[23:39:16] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:39:34] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp7004.magru.wmnet
[23:39:35] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7004.magru.wmnet
[23:46:34] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10494148 (10BCornwall) I did some more testing:  (Rounded/eyeballed averages) | Profile | Offset | Fan RPS | CPU Temp (Celsius) | Default | None | 4k | 80 | Maximum Perform...
[23:49:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P72403 and previous config saved to /var/cache/conftool/dbconfig/20250124-234914-marostegui.json
[23:51:30] <jinxer-wm>	 FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer