[00:02:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P72318 and previous config saved to /var/cache/conftool/dbconfig/20250124-000200-marostegui.json [00:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P72319 and previous config saved to /var/cache/conftool/dbconfig/20250124-001708-marostegui.json [00:31:42] RECOVERY - Disk space on arclamp1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops [00:31:48] RECOVERY - Disk space on arclamp2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops [00:32:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T384592)', diff saved to https://phabricator.wikimedia.org/P72320 and previous config saved to /var/cache/conftool/dbconfig/20250124-003215-marostegui.json [00:32:19] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [00:32:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2180.codfw.wmnet with reason: Maintenance [00:32:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72321 and previous config saved to /var/cache/conftool/dbconfig/20250124-003237-marostegui.json [00:34:55] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113893 [00:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113893 (owner: 10TrainBranchBot) [00:41:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72322 and previous config saved to /var/cache/conftool/dbconfig/20250124-004102-marostegui.json [00:41:07] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [00:52:55] (03CR) 10Ladsgroup: "Is this needed? I think we can abandon this. The production has moved on for ten weekly releases." [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088329 (https://phabricator.wikimedia.org/T379150) (owner: 10Jforrester) [00:54:21] (03CR) 10Ladsgroup: "A lot of time has passed, I think you still need this, but want to double check before deploying." [puppet] - 10https://gerrit.wikimedia.org/r/1072268 (owner: 10Jforrester) [00:56:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P72323 and previous config saved to /var/cache/conftool/dbconfig/20250124-005609-marostegui.json [00:59:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113893 (owner: 10TrainBranchBot) [01:01:58] (03PS3) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1104740 [01:08:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113895 [01:08:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113895 (owner: 10TrainBranchBot) [01:11:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P72324 and previous config saved to /var/cache/conftool/dbconfig/20250124-011116-marostegui.json [01:21:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491086 (10phaultfinder) [01:26:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72325 and previous config saved to /var/cache/conftool/dbconfig/20250124-012623-marostegui.json [01:26:28] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [01:26:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2193.codfw.wmnet with reason: Maintenance [01:26:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T384592)', diff saved to https://phabricator.wikimedia.org/P72326 and previous config saved to /var/cache/conftool/dbconfig/20250124-012645-marostegui.json [01:28:46] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113895 (owner: 10TrainBranchBot) [01:33:46] PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [01:34:36] RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Tue 18 Feb 2025 07:56:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [01:34:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T384592)', diff saved to https://phabricator.wikimedia.org/P72327 and previous config saved to /var/cache/conftool/dbconfig/20250124-013444-marostegui.json [01:34:49] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [01:49:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P72328 and previous config saved to /var/cache/conftool/dbconfig/20250124-014951-marostegui.json [02:04:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P72329 and previous config saved to /var/cache/conftool/dbconfig/20250124-020458-marostegui.json [02:17:42] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:17:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:18:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.788 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:18:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53369 bytes in 8.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:20:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T384592)', diff saved to https://phabricator.wikimedia.org/P72330 and previous config saved to /var/cache/conftool/dbconfig/20250124-022005-marostegui.json [02:20:09] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [02:20:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2197.codfw.wmnet with reason: Maintenance [02:31:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:32:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:34:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491141 (10phaultfinder) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:45] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2217.codfw.wmnet with reason: Maintenance [02:48:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T384592)', diff saved to https://phabricator.wikimedia.org/P72331 and previous config saved to /var/cache/conftool/dbconfig/20250124-024851-marostegui.json [02:48:56] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:16:26] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T384592)', diff saved to https://phabricator.wikimedia.org/P72332 and previous config saved to /var/cache/conftool/dbconfig/20250124-031655-marostegui.json [03:17:00] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [03:32:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P72333 and previous config saved to /var/cache/conftool/dbconfig/20250124-033202-marostegui.json [03:34:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491170 (10phaultfinder) [03:47:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P72334 and previous config saved to /var/cache/conftool/dbconfig/20250124-034709-marostegui.json [04:02:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T384592)', diff saved to https://phabricator.wikimedia.org/P72335 and previous config saved to /var/cache/conftool/dbconfig/20250124-040216-marostegui.json [04:02:23] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [04:02:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2224.codfw.wmnet with reason: Maintenance [04:02:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T384592)', diff saved to https://phabricator.wikimedia.org/P72336 and previous config saved to /var/cache/conftool/dbconfig/20250124-040239-marostegui.json [04:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:23:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:29:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T384592)', diff saved to https://phabricator.wikimedia.org/P72337 and previous config saved to /var/cache/conftool/dbconfig/20250124-042942-marostegui.json [04:29:47] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [04:34:55] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:44:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P72338 and previous config saved to /var/cache/conftool/dbconfig/20250124-044449-marostegui.json [04:59:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P72339 and previous config saved to /var/cache/conftool/dbconfig/20250124-045955-marostegui.json [05:15:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T384592)', diff saved to https://phabricator.wikimedia.org/P72340 and previous config saved to /var/cache/conftool/dbconfig/20250124-051503-marostegui.json [05:15:07] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [05:15:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2229.codfw.wmnet with reason: Maintenance [05:15:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2229 (T384592)', diff saved to https://phabricator.wikimedia.org/P72341 and previous config saved to /var/cache/conftool/dbconfig/20250124-051525-marostegui.json [05:16:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:16:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:17:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53368 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:19:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 7.499 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:20:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:21:58] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53368 bytes in 9.858 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:22:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:22:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:27:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:27:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:28:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:35:05] (03PS1) 10Marostegui: Revert "db1222: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113911 [05:35:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72342 and previous config saved to /var/cache/conftool/dbconfig/20250124-053535-root.json [05:35:52] (03CR) 10Marostegui: [C:03+2] Revert "db1222: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113911 (owner: 10Marostegui) [05:42:14] (03PS1) 10Marostegui: Revert "db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113912 [05:42:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72343 and previous config saved to /var/cache/conftool/dbconfig/20250124-054227-root.json [05:42:38] (03CR) 10Marostegui: [C:03+2] Revert "db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113912 (owner: 10Marostegui) [05:43:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T384592)', diff saved to https://phabricator.wikimedia.org/P72344 and previous config saved to /var/cache/conftool/dbconfig/20250124-054335-marostegui.json [05:43:39] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [05:45:07] (03Abandoned) 10Marostegui: mariadb: productionize db2226 [puppet] - 10https://gerrit.wikimedia.org/r/1087902 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [05:47:37] (03PS1) 10Marostegui: backup1002.cnf.erb: Replace es1022 with es1043 [puppet] - 10https://gerrit.wikimedia.org/r/1113913 (https://phabricator.wikimedia.org/T382569) [05:49:07] (03CR) 10Marostegui: "This is a NOOP. In any case, es1043 was cloned from es1022, so the data is the same and are the users. dump user exists on es1043." [puppet] - 10https://gerrit.wikimedia.org/r/1113913 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [05:49:38] (03CR) 10Marostegui: [C:03+2] backup1002.cnf.erb: Replace es1022 with es1043 [puppet] - 10https://gerrit.wikimedia.org/r/1113913 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [05:50:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72345 and previous config saved to /var/cache/conftool/dbconfig/20250124-055042-root.json [05:57:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72346 and previous config saved to /var/cache/conftool/dbconfig/20250124-055733-root.json [05:58:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P72347 and previous config saved to /var/cache/conftool/dbconfig/20250124-055842-marostegui.json [06:04:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:05:23] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:05:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72348 and previous config saved to /var/cache/conftool/dbconfig/20250124-060547-root.json [06:06:13] (03CR) 10Kevin Bazira: EventStreamConfig: Add mediawiki.article_country_prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [06:07:02] (03CR) 10Kevin Bazira: "as discussed in yesterday's meeting, we will begin by producing the weighted tags stream: https://phabricator.wikimedia.org/T382295#104898" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [06:12:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72349 and previous config saved to /var/cache/conftool/dbconfig/20250124-061238-root.json [06:13:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P72350 and previous config saved to /var/cache/conftool/dbconfig/20250124-061348-marostegui.json [06:20:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72351 and previous config saved to /var/cache/conftool/dbconfig/20250124-062052-root.json [06:27:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72352 and previous config saved to /var/cache/conftool/dbconfig/20250124-062744-root.json [06:28:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T384592)', diff saved to https://phabricator.wikimedia.org/P72353 and previous config saved to /var/cache/conftool/dbconfig/20250124-062855-marostegui.json [06:29:01] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [06:35:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72354 and previous config saved to /var/cache/conftool/dbconfig/20250124-063557-root.json [06:42:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72355 and previous config saved to /var/cache/conftool/dbconfig/20250124-064249-root.json [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T0700) [07:09:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:16:26] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:42] Hi operations folks! Automatic citations in VE are broken on most wikis. Is that worth an emergency deploy on a Friday? https://phabricator.wikimedia.org/T384661 [07:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491304 (10phaultfinder) [07:35:19] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 543MiB (3% inode=38%): /tmp 543MiB (3% inode=38%): /var/tmp 543MiB (3% inode=38%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [07:37:54] (03PS1) 10Mvolz: Revert "Warn if 'preprint', 'dataset', or 'standard' key is missing" [extensions/Citoid] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113948 (https://phabricator.wikimedia.org/T384661) [07:39:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Citoid] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113948 (https://phabricator.wikimedia.org/T384661) (owner: 10Mvolz) [07:42:11] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [07:43:17] PROBLEM - Hadoop NodeManager on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:44:35] PROBLEM - Hadoop NodeManager on an-worker1175 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:47:00] (03PS1) 10Marostegui: mariadb: Remove es1022 [puppet] - 10https://gerrit.wikimedia.org/r/1113949 (https://phabricator.wikimedia.org/T384566) [07:47:13] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 244.11 ms [07:47:39] thcipriani: brennen: help! I'd like to do an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Citoid/+/1113948/ -- context is T384661 [07:47:40] T384661: Citoid's automatic reference feature is broken in multiple wikis - https://phabricator.wikimedia.org/T384661 [07:48:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1022.eqiad.wmnet [07:48:47] (following the template in Deployments/Emergencies. Not sure if it's UBN level, but it is a pretty prominent feature that many rely on, that would stay broken for most languages over the whole weekend if it isn't deployed now.) [07:49:33] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es1022 [puppet] - 10https://gerrit.wikimedia.org/r/1113949 (https://phabricator.wikimedia.org/T384566) (owner: 10Marostegui) [07:51:50] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: os upgrade [07:53:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2022.codfw.wmnet with OS bookworm [07:53:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS bookworm [07:54:15] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [07:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491322 (10phaultfinder) [07:55:19] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [07:55:33] (03CR) 10Jcrespo: [C:03+1] backup1002.cnf.erb: Replace es1022 with es1043 [puppet] - 10https://gerrit.wikimedia.org/r/1113913 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [07:57:53] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [07:58:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [07:58:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:58:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1022.eqiad.wmnet [07:58:56] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1022.eqiad.wmnet - https://phabricator.wikimedia.org/T384566#10491323 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts: `es1022.eqiad.wmnet` - es1022.eqiad.wmnet (**PASS**) - Downtimed... [07:58:58] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1022.eqiad.wmnet - https://phabricator.wikimedia.org/T384566#10491324 (10Marostegui) a:05Marostegui→03None [07:59:07] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1022.eqiad.wmnet - https://phabricator.wikimedia.org/T384566#10491330 (10Marostegui) This is ready for #dc-ops [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T0800) [08:03:17] RECOVERY - Hadoop NodeManager on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:04:47] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1225.eqiad.wmnet with OS bookworm [08:04:52] (03PS1) 10Marostegui: instances.yaml: Remove es1023 [puppet] - 10https://gerrit.wikimedia.org/r/1113950 (https://phabricator.wikimedia.org/T384679) [08:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:08] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1023 [puppet] - 10https://gerrit.wikimedia.org/r/1113950 (https://phabricator.wikimedia.org/T384679) (owner: 10Marostegui) [08:07:37] RECOVERY - Hadoop NodeManager on an-worker1175 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:08:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1044 to es5 master', diff saved to https://phabricator.wikimedia.org/P72356 and previous config saved to /var/cache/conftool/dbconfig/20250124-080804-root.json [08:08:43] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:43] !log Remove es1023 from es5 eqiad dbmaint T384679 [08:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:47] T384679: decommission es1023.eqiad.wmnet - https://phabricator.wikimedia.org/T384679 [08:10:24] (03PS1) 10Marostegui: es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113951 (https://phabricator.wikimedia.org/T384679) [08:10:51] (03CR) 10Marostegui: [C:03+2] es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113951 (https://phabricator.wikimedia.org/T384679) (owner: 10Marostegui) [08:11:03] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: os upgrade [08:18:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1201.eqiad.wmnet with reason: Maintenance [08:18:57] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 228.76 ms [08:21:34] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1225.eqiad.wmnet with reason: host reimage [08:25:02] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1225.eqiad.wmnet with reason: host reimage [08:25:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:29:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2214.codfw.wmnet with reason: Maintenance [08:30:10] (03CR) 10Jelto: "The point I tried to make is: when the Kubernetes cluster is upgraded to `1.31` the `kubectl` image has to be updated as well, otherwise t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [08:30:22] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1216.eqiad.wmnet with OS bookworm [08:34:55] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1156.eqiad.wmnet with reason: Maintenance [08:36:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:36:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T384592)', diff saved to https://phabricator.wikimedia.org/P72357 and previous config saved to /var/cache/conftool/dbconfig/20250124-083638-marostegui.json [08:36:44] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [08:42:47] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2022.codfw.wmnet with reason: host reimage [08:46:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2022.codfw.wmnet with reason: host reimage [08:47:25] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1216.eqiad.wmnet with reason: host reimage [08:49:08] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1225.eqiad.wmnet with OS bookworm [08:51:26] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1216.eqiad.wmnet with reason: host reimage [09:02:37] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10491413 (10MoritzMuehlenhoff) [09:05:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2022.codfw.wmnet with OS bookworm [09:05:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491415 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS bookworm completed: - ganeti202... [09:10:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [09:14:07] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1216.eqiad.wmnet with OS bookworm [09:18:14] (03PS1) 10Vgutierrez: liberica: Add katran config settings [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450) [09:18:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [09:20:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2022.codfw.wmnet to cluster codfw and group B [09:21:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2022.codfw.wmnet to cluster codfw and group B [09:21:59] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10491430 (10MoritzMuehlenhoff) [09:24:22] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:29:33] (03PS1) 10Muehlenhoff: Switch ganeti2020 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113963 [09:32:22] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:36:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T384592)', diff saved to https://phabricator.wikimedia.org/P72358 and previous config saved to /var/cache/conftool/dbconfig/20250124-093614-marostegui.json [09:36:19] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [09:36:50] (03PS2) 10Vgutierrez: liberica: Add katran config settings [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450) [09:36:56] (03CR) 10Marostegui: [C:03+1] "Remember to run dbctl config commit -m once puppet has been merged" [puppet] - 10https://gerrit.wikimedia.org/r/1113849 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto) [09:43:22] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:43:39] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow1002.eqiad.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [09:43:48] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10491452 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=43ff15dd-e256-46b3-aea6-882240b9fe64) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th... [09:50:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:51:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P72359 and previous config saved to /var/cache/conftool/dbconfig/20250124-095121-marostegui.json [09:55:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:57:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:01:44] !log mnz@deploy2002 Started deploy [airflow-dags/research@ba61f77]: (no justification provided) [10:01:54] !log mnz@deploy2002 Finished deploy [airflow-dags/research@ba61f77]: (no justification provided) (duration: 00m 12s) [10:02:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:03:33] RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 83, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:05:23] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:05:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:06:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P72360 and previous config saved to /var/cache/conftool/dbconfig/20250124-100628-marostegui.json [10:09:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:10:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:10:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:11:33] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:11:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:12:27] !log mnz@deploy2002 Started deploy [airflow-dags/research@95b14c7]: (no justification provided) [10:13:01] !log mnz@deploy2002 Finished deploy [airflow-dags/research@95b14c7]: (no justification provided) (duration: 00m 43s) [10:19:31] (03CR) 10Jelto: [C:03+1] "thanks a lot for adding the currently unused receivers, this looks good now!" [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn) [10:21:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T384592)', diff saved to https://phabricator.wikimedia.org/P72361 and previous config saved to /var/cache/conftool/dbconfig/20250124-102135-marostegui.json [10:21:40] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [10:21:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:21:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72362 and previous config saved to /var/cache/conftool/dbconfig/20250124-102157-marostegui.json [10:30:18] (03CR) 10Jelto: [C:03+2] "I'll merge this to test the change and have a bit more visibility for current Gerrit incidents. Thanks for figuring out the route and rece" [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn) [10:31:39] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:33:39] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:35:09] PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:46:09] RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:46:15] (03CR) 10Federico Ceratto: [C:03+1] instances.yaml: Remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113849 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto) [10:47:41] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: Remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113849 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto) [10:50:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2140 from dbctl T384480', diff saved to https://phabricator.wikimedia.org/P72363 and previous config saved to /var/cache/conftool/dbconfig/20250124-105029-fceratto.json [10:50:34] T384480: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480 [11:05:40] (03PS1) 10Federico Ceratto: site.pp, db2140.yaml: remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113967 (https://phabricator.wikimedia.org/T384480) [11:06:49] (03Abandoned) 10Federico Ceratto: site.pp remove "Future" as db2233 is already the master [puppet] - 10https://gerrit.wikimedia.org/r/1113436 (owner: 10Federico Ceratto) [11:18:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72365 and previous config saved to /var/cache/conftool/dbconfig/20250124-111834-marostegui.json [11:18:37] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491747 (10MoritzMuehlenhoff) [11:18:39] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [11:20:29] (03PS3) 10JMeybohm: Import upstream release 1.24.2 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1113460 (https://phabricator.wikimedia.org/T341984) [11:23:37] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10491761 (10phaultfinder) [11:25:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [11:25:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491762 (10ops-monitoring-bot) Draining ganeti2020.codfw.wmnet of running VMs [11:29:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [11:33:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd2001.codfw.wmnet to drbd [11:33:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P72366 and previous config saved to /var/cache/conftool/dbconfig/20250124-113341-marostegui.json [11:33:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491808 (10ops-monitoring-bot) VM ml-etcd2001.codfw.wmnet switching disk type to drbd [11:42:00] (03Abandoned) 10Jelto: gerrit: change blackbox checks to collaboration-services/task [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto) [11:42:56] (03CR) 10Jelto: "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113800 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:43:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd2001.codfw.wmnet to drbd [11:43:37] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:37] FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [11:44:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491906 (10ops-monitoring-bot) Draining ganeti2020.codfw.wmnet of running VMs [11:45:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [11:45:48] (03PS1) 10Kamila Součková: wikikube: rename parse10[07-12] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113974 (https://phabricator.wikimedia.org/T365571) [11:46:41] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:46:56] (03Abandoned) 10Lucas Werkmeister (WMDE): Increase nonexistent item ID for Commons constraint checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112054 (owner: 10Lucas Werkmeister (WMDE)) [11:47:41] RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 79, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:48:27] (03PS1) 10Jelto: apt: update gitlab-ce to 17.7 [puppet] - 10https://gerrit.wikimedia.org/r/1113975 (https://phabricator.wikimedia.org/T379598) [11:48:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd2001.codfw.wmnet to plain [11:48:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P72367 and previous config saved to /var/cache/conftool/dbconfig/20250124-114848-marostegui.json [11:48:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10491938 (10ops-monitoring-bot) VM ml-etcd2001.codfw.wmnet switching disk type to plain [11:49:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd2001.codfw.wmnet to plain [11:51:15] PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:57:45] PROBLEM - Hadoop NodeManager on an-worker1134 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T0800) [12:00:05] jelto, arnoldokoth, and mutante: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GitLab version upgrades . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T1200). [12:03:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T384592)', diff saved to https://phabricator.wikimedia.org/P72368 and previous config saved to /var/cache/conftool/dbconfig/20250124-120355-marostegui.json [12:04:00] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [12:04:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:04:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T384592)', diff saved to https://phabricator.wikimedia.org/P72369 and previous config saved to /var/cache/conftool/dbconfig/20250124-120417-marostegui.json [12:07:17] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10491980 (10MoritzMuehlenhoff) [12:15:15] RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:17:45] RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:18:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T384592)', diff saved to https://phabricator.wikimedia.org/P72370 and previous config saved to /var/cache/conftool/dbconfig/20250124-121848-marostegui.json [12:18:54] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [12:20:31] FTR, I got some support voices for the emergency deploy Jhs requested in the security channel, so I’m going to go ahead with it [12:20:56] as Citoid is fairly important, and the risk for serious breakage from a JS change should be fairly low [12:21:06] if anyone objects, you have ca. 10 minutes to stop me while gate-and-submit runs :) [12:21:14] cc mvolz too, so they're aware :) [12:21:33] oh, apparently there’s actually a window right now ^^ [12:21:36] jouncebot: now [12:21:36] For the next 19 hour(s) and 38 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T0800) [12:21:36] For the next 0 hour(s) and 8 minute(s): GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250124T1200) [12:21:55] jelto, arnoldokoth, mutante: okay for me to do a mediawiki deploy during the gitlab upgrade? [12:22:01] (I’m gonna assume yes unless I hear otherwise) [12:22:20] also cc thcipriani and brennen for the emergency deploy per due process and stuff ^^ [12:22:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Citoid] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113948 (https://phabricator.wikimedia.org/T384661) (owner: 10Mvolz) [12:26:42] (03PS1) 10JMeybohm: CI: Ensure admin checks don't run unnecessary template calls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113979 [12:26:42] (03PS1) 10JMeybohm: CI: Fix helm errors hiding behind YAML parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 [12:28:42] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:30:44] (03CR) 10JMeybohm: [C:03+2] Update cert-manager to 1.16.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113801 (owner: 10JMeybohm) [12:30:52] (03CR) 10JMeybohm: [C:03+2] Update coredns to 1.11.3 / coredns helm chart 1.37.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113454 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:31:27] Jhs: is there a dashboard/graph where it is visible that citoid is not working well? we may consider adding an alert there [12:31:28] (03CR) 10JMeybohm: [V:03+2 C:03+2] Update istio to 1.24.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113507 (https://phabricator.wikimedia.org/T373526) (owner: 10JMeybohm) [12:31:34] (03CR) 10JMeybohm: [V:03+2 C:03+2] Update coredns to 1.11.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113445 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:31:54] effie, not that I know of. I became aware of the issue because sjoerd mentioned it in a Discord chat [12:32:29] Jhs: in that case, it sounds like we are missing some visibility there [12:33:05] Lucas_WMDE: yes, no GitLab upgrade today [12:33:09] (03Merged) 10jenkins-bot: Revert "Warn if 'preprint', 'dataset', or 'standard' key is missing" [extensions/Citoid] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113948 (https://phabricator.wikimedia.org/T384661) (owner: 10Mvolz) [12:33:10] (03CR) 10JMeybohm: [C:03+2] Pin cert-manager version on all clustes to 1.10.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113800 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:33:12] ok thanks! [12:33:39] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1113948|Revert "Warn if 'preprint', 'dataset', or 'standard' key is missing" (T384661)]] [12:33:43] T384661: Citoid's automatic reference feature is broken in multiple wikis - https://phabricator.wikimedia.org/T384661 [12:33:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P72371 and previous config saved to /var/cache/conftool/dbconfig/20250124-123355-marostegui.json [12:33:57] I’m not seeing any useful-looking charts in https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-2d&to=now [12:34:16] (but I think that’s also the dashboard for Citoid the service rather than Citoid the extension? not sure) [12:34:53] (03CR) 10JMeybohm: [C:03+1] wikikube: rename parse10[07-12] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113974 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [12:35:41] effie, the change that's being reverted has two effects: 1: disable the automatic tab (bad), but 2: add an mw.log.warn(). Maybe that mw.log is actually logged somewhere and can be monitored? [12:36:04] line 52 here: https://gerrit.wikimedia.org/g/mediawiki/extensions/Citoid/+/c6c50c6f9075c23e3735f3b422f1b8ae9866d44e/modules/ve/ve.ui.Citoid.init.js [12:37:04] I doubt it… I think mw.log.warn is really just the same as console.warn https://doc.wikimedia.org/mediawiki-core/master/js/startup_mediawiki.js.html#line149 [12:37:38] (it’s not like mw.logdeprecate / makeDeprecated which includes some tracking) [12:37:46] (*mw.log.deprecate) [12:38:31] mhm, right [12:38:33] !log lucaswerkmeister-wmde@deploy2002 mvolz, lucaswerkmeister-wmde: Backport for [[gerrit:1113948|Revert "Warn if 'preprint', 'dataset', or 'standard' key is missing" (T384661)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:38:37] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:38:45] Jhs: can you test the revert on WikimediaDebug? [12:39:04] (if there’s still a wiki that didn’t fix the config – I only see the fixed wikis in the task description) [12:40:02] Lucas_WMDE, tested, works like it should 👍 [12:40:05] !log lucaswerkmeister-wmde@deploy2002 mvolz, lucaswerkmeister-wmde: Continuing with sync [12:40:07] nice \o/ [12:40:57] Tested in elwiki and jvwiki, which had a disabled automatic tab before I was on WikimediaDebug, and it was working again after. Also tested on nowiki for control just to check that it still works there too. So all good 👍 [12:40:59] (03Merged) 10jenkins-bot: Pin cert-manager version on all clustes to 1.10.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113800 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:41:05] Jhs: I will reply on the relevant task, but it sounds like it is not ideal severe issues as this one going undetected until someone complains [12:41:15] (03Merged) 10jenkins-bot: Update cert-manager to 1.16.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113801 (owner: 10JMeybohm) [12:41:17] effie, mhm, agree [12:41:30] (03Merged) 10jenkins-bot: Update coredns to 1.11.3 / coredns helm chart 1.37.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113454 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:42:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:46:43] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113948|Revert "Warn if 'preprint', 'dataset', or 'standard' key is missing" (T384661)]] (duration: 13m 04s) [12:46:48] T384661: Citoid's automatic reference feature is broken in multiple wikis - https://phabricator.wikimedia.org/T384661 [12:46:49] * Lucas_WMDE done deploying [12:49:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P72372 and previous config saved to /var/cache/conftool/dbconfig/20250124-124902-marostegui.json [12:49:13] (03PS1) 10Cyndywikime: Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) [12:49:53] Jhs: are we aware how long this has been broken for? [12:50:03] (03CR) 10CI reject: [V:04-1] Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) (owner: 10Cyndywikime) [12:50:30] effie, not very long I think. Sjoerd mentioned it in a Discord yesterday. I personally used it on Monday when holding an editing course. So at most 3 days? [12:50:57] And not on all wikis – so enwiki was not affected, because it already had the config items that became obligatory [12:52:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:52:33] these are the wikis that would be affected, theoretically: https://global-search.toolforge.org/?q=.®ex=1&namespaces=8&title=Citoid-template-type-map.json (But not all of them were in practice, because that json page doesn't actually have an effect on all of them – some of the ones I tested didn't have the automatic/manual/reuse tabs at all, but rather just a dropdown. I'm not sure why.) [12:58:37] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:01:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Check link from msw1-eqiad et-0/1/0 to msw2-eqiad et-0/1/0 - https://phabricator.wikimedia.org/T384708 (10cmooney) 03NEW p:05Triage→03Low [13:03:37] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:04:06] (03CR) 10AOkoth: "Oooh. I should probably change this to the `latest` tag then and probably change the pull policy to always. From my conversation with Alex" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:04:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T384592)', diff saved to https://phabricator.wikimedia.org/P72373 and previous config saved to /var/cache/conftool/dbconfig/20250124-130409-marostegui.json [13:04:14] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:04:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:04:28] (03PS1) 10Jelto: gerrit: block Digital Ocean IP for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1113986 (https://phabricator.wikimedia.org/T384706) [13:04:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T384592)', diff saved to https://phabricator.wikimedia.org/P72374 and previous config saved to /var/cache/conftool/dbconfig/20250124-130431-marostegui.json [13:05:27] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113986 (https://phabricator.wikimedia.org/T384706) (owner: 10Jelto) [13:05:49] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4860/co" [puppet] - 10https://gerrit.wikimedia.org/r/1113986 (https://phabricator.wikimedia.org/T384706) (owner: 10Jelto) [13:06:14] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: block Digital Ocean IP for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1113986 (https://phabricator.wikimedia.org/T384706) (owner: 10Jelto) [13:06:42] (03PS11) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [13:09:14] (03PS12) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [13:13:13] Thank you Lucas_WMDE and Jhs :). effie: It was broken on all wikis except for en wiki for however long 1.44.0-wmf.13 was deployed. So, 3 days for group 0, 1, 2 for group 1, about 1 day for group 2. And I agree there is a process problem here. [13:13:19] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse[1007-1012].eqiad.wmnet [13:13:26] (03CR) 10Cathal Mooney: [C:03+2] Fr-tech provision script to assign IPs and switch ports (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [13:13:54] (03PS13) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [13:14:20] (03CR) 10Kamila Součková: [C:03+2] wikikube: rename parse10[07-12] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113974 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [13:15:27] (03CR) 10CI reject: [V:04-1] Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [13:16:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse[1007-1012].eqiad.wmnet [13:17:33] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1007 to wikikube-worker1148 [13:17:54] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:18:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T384592)', diff saved to https://phabricator.wikimedia.org/P72375 and previous config saved to /var/cache/conftool/dbconfig/20250124-131812-marostegui.json [13:18:17] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:18:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv [13:18:38] e - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:18:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv [13:18:38] e - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:20:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113975 (https://phabricator.wikimedia.org/T379598) (owner: 10Jelto) [13:20:31] (03PS10) 10Cathal Mooney: Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) [13:21:26] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1007 to wikikube-worker1148 - kamila@cumin1002" [13:21:40] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1008 to wikikube-worker1149 [13:21:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1007 to wikikube-worker1148 - kamila@cumin1002" [13:21:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:21:45] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1148 [13:22:00] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:22:47] (03CR) 10Jelto: [C:03+1] "lgtm" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113752 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:23:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1148 [13:23:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1007 to wikikube-worker1148 [13:24:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10492166 (10elukey) From the megacli's perspective, the drive was `Unconfigured (bad)` and I was able to make it `Good` but then, as exp... [13:25:44] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1008 to wikikube-worker1149 - kamila@cumin1002" [13:25:59] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1009 to wikikube-worker1150 [13:26:04] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1008 to wikikube-worker1149 - kamila@cumin1002" [13:26:04] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:26:04] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1149 [13:26:19] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:27:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1149 [13:27:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1008 to wikikube-worker1149 [13:29:48] (03CR) 10Cathal Mooney: [C:03+2] Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [13:29:59] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1009 to wikikube-worker1150 - kamila@cumin1002" [13:30:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1009 to wikikube-worker1150 - kamila@cumin1002" [13:30:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:30:17] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1150 [13:30:23] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1010 to wikikube-worker1151 [13:30:43] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:30:43] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1216.eqiad.wmnet with reason: rebuilding tables [13:31:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1150 [13:31:45] (03Merged) 10jenkins-bot: Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [13:32:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1009 to wikikube-worker1150 [13:32:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on parse1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:33:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P72376 and previous config saved to /var/cache/conftool/dbconfig/20250124-133319-marostegui.json [13:33:22] (03CR) 10Jelto: "I'd say usage of `latest` should be avoided in production, we just have to rebuild and bump the image version to make the image compatible" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:33:35] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:33:37] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.netbox.update-extras (exit_code=97) rolling restart_daemons on A:netbox-canary [13:34:03] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:34:15] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:34:20] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1010 to wikikube-worker1151 - kamila@cumin1002" [13:34:33] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1011 to wikikube-worker1152 [13:34:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1010 to wikikube-worker1151 - kamila@cumin1002" [13:34:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:34:38] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1151 [13:34:54] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:35:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1151 [13:36:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1010 to wikikube-worker1151 [13:37:00] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10492194 (10elukey) I also tried via the BMC's webui, that interestingly shows the disk in `Unconfigured (bad)` state (and I cannot neit... [13:38:37] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:40] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1011 to wikikube-worker1152 - kamila@cumin1002" [13:38:52] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1012 to wikikube-worker1153 [13:38:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1011 to wikikube-worker1152 - kamila@cumin1002" [13:38:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:38:57] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1152 [13:39:12] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:40:04] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1152 [13:40:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1011 to wikikube-worker1152 [13:41:01] 06SRE, 06Data-Platform-SRE, 10superset.wikimedia.org: Degraded Superset functionality during a high-traffic incident - https://phabricator.wikimedia.org/T384301#10492210 (10Gehel) p:05Triage→03Medium [13:41:30] (03Abandoned) 10Nikerabbit: RecentChangesTranslationFilterHookHandler: Replace call to deprecated ChangeTags::getDisplayTableName() [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088329 (https://phabricator.wikimedia.org/T379150) (owner: 10Jforrester) [13:41:37] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye [13:42:37] 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10492221 (10cmooney) >>! In T379072#10295326, @Volans wrote: > In our netbox config we have for the logging formatters: > ` > 'django.server': { > '()': 'django.utils.log... [13:42:46] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1012 to wikikube-worker1153 - kamila@cumin1002" [13:43:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1012 to wikikube-worker1153 - kamila@cumin1002" [13:43:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:43:03] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1153 [13:43:37] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:43:53] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:44:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1153 [13:44:24] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:44:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1012 to wikikube-worker1153 [13:45:04] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1148.eqiad.wmnet wikikube-worker1149.eqiad.wmnet wikikube-worker1150.eqiad.wmnet wikikube-worker1151.eqiad.wmnet wikikube-worker1152.eqiad.wmnet wikikube-worker1153.eqiad.wmnet on all recursors [13:45:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1148.eqiad.wmnet wikikube-worker1149.eqiad.wmnet wikikube-worker1150.eqiad.wmnet wikikube-worker1151.eqiad.wmnet wikikube-worker1152.eqiad.wmnet wikikube-worker1153.eqiad.wmnet on all recursors [13:48:05] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1148.eqiad.wmnet with OS bookworm [13:48:09] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1149.eqiad.wmnet with OS bookworm [13:48:10] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1148 [13:48:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1148 [13:48:12] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1149 [13:48:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1149 [13:48:18] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1150.eqiad.wmnet with OS bookworm [13:48:21] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1150 [13:48:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1150 [13:48:25] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1151.eqiad.wmnet with OS bookworm [13:48:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P72378 and previous config saved to /var/cache/conftool/dbconfig/20250124-134826-marostegui.json [13:48:28] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1151 [13:48:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1151 [13:48:31] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1152.eqiad.wmnet with OS bookworm [13:48:36] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1153.eqiad.wmnet with OS bookworm [13:48:39] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1152 [13:48:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1152 [13:48:40] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1153 [13:48:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1153 [13:55:31] (03PS1) 10Kamila Součková: wikikube: rename parse101[3-7] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113989 (https://phabricator.wikimedia.org/T365571) [13:56:16] (03CR) 10Jelto: [C:03+1] "lgtm" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1113460 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:57:51] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1113989 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [14:02:56] (03PS2) 10JMeybohm: CI: Fix helm errors hiding behind YAML parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 [14:02:58] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113474 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:03:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T384592)', diff saved to https://phabricator.wikimedia.org/P72379 and previous config saved to /var/cache/conftool/dbconfig/20250124-140333-marostegui.json [14:03:36] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1148.eqiad.wmnet with reason: host reimage [14:03:38] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:03:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1222.eqiad.wmnet with reason: Maintenance [14:04:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [14:04:11] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1149.eqiad.wmnet with reason: host reimage [14:04:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72380 and previous config saved to /var/cache/conftool/dbconfig/20250124-140410-marostegui.json [14:04:17] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1153.eqiad.wmnet with reason: host reimage [14:04:21] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1150.eqiad.wmnet with reason: host reimage [14:04:27] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1152.eqiad.wmnet with reason: host reimage [14:04:31] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1151.eqiad.wmnet with reason: host reimage [14:05:12] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1013.eqiad.wmnet with OS bullseye [14:05:28] (03PS3) 10JMeybohm: CI: Fix helm errors hiding behind YAML parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 [14:05:37] (03CR) 10JMeybohm: [V:03+2 C:03+2] Update cert-manager to 1.16.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113752 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:05:51] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye [14:07:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1148.eqiad.wmnet with reason: host reimage [14:07:56] !log cmooney@cumin1002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet [14:10:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1149.eqiad.wmnet with reason: host reimage [14:11:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet [14:13:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1153.eqiad.wmnet with reason: host reimage [14:17:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1150.eqiad.wmnet with reason: host reimage [14:18:36] (03CR) 10Alexandros Kosiaris: [C:03+1] "+1 on the configuration part. As for the $REASONS, I think this predated the introduction of discovery.wmnet DNS RRs to begin with. IIRC, " [puppet] - 10https://gerrit.wikimedia.org/r/1113562 (https://phabricator.wikimedia.org/T383946) (owner: 10Dzahn) [14:22:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1152.eqiad.wmnet with reason: host reimage [14:24:05] (03PS1) 10Muehlenhoff: Remove access for sstefanova [puppet] - 10https://gerrit.wikimedia.org/r/1113993 [14:24:40] (03CR) 10CI reject: [V:04-1] Remove access for sstefanova [puppet] - 10https://gerrit.wikimedia.org/r/1113993 (owner: 10Muehlenhoff) [14:25:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1148.eqiad.wmnet with OS bookworm [14:25:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1151.eqiad.wmnet with reason: host reimage [14:26:51] (03PS2) 10Muehlenhoff: Remove access for sstefanova [puppet] - 10https://gerrit.wikimedia.org/r/1113993 [14:28:15] (03CR) 10Muehlenhoff: [C:03+2] Remove access for sstefanova [puppet] - 10https://gerrit.wikimedia.org/r/1113993 (owner: 10Muehlenhoff) [14:29:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1149.eqiad.wmnet with OS bookworm [14:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10492343 (10phaultfinder) [14:31:42] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10492346 (10RobH) >>! In T373993#10490884, @BCornwall wrote: > > The first dip on all the hosts was unrelated to anything I did - not sure what happened t... [14:33:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1153.eqiad.wmnet with OS bookworm [14:36:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10492369 (10cmooney) @VRiley-WMF ok so after a bit more back-and-forth I think we can finally trial this new script and see how it works: https://netbox-next.wikimedia.org/ext... [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1150.eqiad.wmnet with OS bookworm [14:40:00] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1013.eqiad.wmnet with OS bullseye [14:43:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1152.eqiad.wmnet with OS bookworm [14:45:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1151.eqiad.wmnet with OS bookworm [14:45:47] FIRING: [2x] ProbeDown: Service ml-serve-ctrl2001:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:54] (03PS1) 10Tsevener: Add ios.article_link_interaction stream to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113996 (https://phabricator.wikimedia.org/T382031) [14:46:07] !incidents [14:46:08] 5627 (UNACKED) [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw) [14:46:12] !ack 5627 [14:46:13] 5627 (ACKED) [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw) [14:46:35] well, another null runbook link :-/ [14:46:47] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [14:47:31] (03CR) 10Marostegui: [C:03+1] "To be run AFTER the decommissioning script has successfully finished." [puppet] - 10https://gerrit.wikimedia.org/r/1113967 (https://phabricator.wikimedia.org/T384480) (owner: 10Federico Ceratto) [14:50:47] RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl2001:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:12] OK, I wasn't getting very far with that (kube_env ml-serv codfw didn't work), but it seems to have resolved itself... [14:52:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72381 and previous config saved to /var/cache/conftool/dbconfig/20250124-145222-marostegui.json [14:52:27] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:53:47] (03PS1) 10Sergio Gimeno: beta: increase growth tasks lookahead size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113997 (https://phabricator.wikimedia.org/T325990) [14:54:11] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [14:55:13] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye [14:55:31] (03PS33) 10Arnaudb: gitlab_runner: migrate ferm rules to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1109726 (https://phabricator.wikimedia.org/T370677) [14:55:31] (03CR) 10Arnaudb: "This patch is supposed to be idempotent with the current state of firewall on runners. It adds things in `modules/nftables`, `modules/fire" [puppet] - 10https://gerrit.wikimedia.org/r/1109726 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb) [14:56:16] (03CR) 10Jelto: "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113979 (owner: 10JMeybohm) [15:00:09] (03PS2) 10Sergio Gimeno: beta: increase growth tasks lookahead size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113997 (https://phabricator.wikimedia.org/T325990) [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:19] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1013.eqiad.wmnet with OS bullseye [15:02:27] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:03:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113996 (https://phabricator.wikimedia.org/T382031) (owner: 10Tsevener) [15:05:47] FIRING: [2x] ProbeDown: Service ml-serve-ctrl2002:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:07] OK, it's paging again, anyone know about this service? [15:06:28] Perhaps btullis? [15:06:35] it's supposedly a kubenetes thing, but if I check srv/deployment-charts/helmfile.d/services on deploy2002 thre's not anything called ml-serv there [15:06:41] Or klausman [15:06:57] so I can't even run kube_env to get to the point where I might list pods or anything [15:07:02] those would be my two guesses as well. [15:07:22] taking a look [15:07:23] The only docs I've found are https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#ml-serve which aren't very enlightening [15:07:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P72382 and previous config saved to /var/cache/conftool/dbconfig/20250124-150729-marostegui.json [15:07:44] https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2002:6443 doesn't exist [15:08:38] Emperor: it's the control-plane for the ml-serve cluster [15:08:44] the kubelet on that machine restarted [15:08:46] Active: active (running) since Fri 2025-01-24 15:05:45 UTC; 2min 27s ago [15:08:48] I was about to disconnect, but an ml host had its disk full [15:08:48] and you don't want service but ml-services [15:08:57] not sure if related [15:09:17] rpobably lab related (I suspect ml-lab1001? [15:09:23] ml-lab1001 [15:09:28] DISK CRITICAL - free space: /srv 14564MiB (3% inode=94%): [15:09:31] yeah, that won't affect prod [15:09:34] ok [15:09:37] ty! [15:09:38] srv/deployment-charts/helmfile.d/ml-services [15:09:40] that is [15:09:57] Oh. [15:10:16] I am not sure why we are paging for a single host of the control-plane though tbh [15:10:33] it has 3 fwiw [15:10:47] or 2 perhaps, I think ml is with 2 [15:10:47] RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl2002:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:11:16] Huh... [15:11:19] Eror while processing event ("/sys/fs/cgroup/system.slice/clean-confd-rundir.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/system.slice/clean-confd-rundir.service: no such file or directory [15:11:21] is it under control or is it flapping? [15:11:32] it's been down-then-up twice recently [15:11:34] !incidents [15:11:34] 5628 (RESOLVED) [2x] ProbeDown sre (ml-serve-ctrl2002:6443 probes/custom codfw) [15:11:34] 5627 (RESOLVED) [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw) [15:11:38] :-( [15:11:53] I'm keeping an eye on its logs [15:12:19] so far no restarts beyond the one at 15:05:45 UTC [15:12:40] (03CR) 10Kamila Součková: [C:03+2] wikikube: rename parse101[3-7] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113989 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [15:12:41] klausman: previous page was at 14:45 [15:12:44] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse[1013-1017].eqiad.wmnet [15:13:33] Yeah, that was similar (errors about stufgf like "/sys/fs/cgroup/system.slice/clean-confd-rundir.service" not existing, and then giving up because it couldn't update the lease [15:14:42] disregard, the cgroup errors are too time-distant to be relevant [15:15:17] The relevant errors are about being unable to talk to the service (10.2.1.39:6443), getting an ECONN [15:16:12] it's from both hosts apparently. both -ctrl2001 and -ctrl2002 complained. And at around the same time [15:17:29] Not clear what would cause it. Maybe a netweork blip? They're VMs. Moritz made one of the associated etcd machines (a VM) be slightly higher latency due to DRBD during a VM move, but that was hours before [15:17:37] https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=k8s-mlserve&from=now-1h&to=now [15:17:49] it's visible in the graphs here [15:18:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse[1013-1017].eqiad.wmnet [15:18:15] work latencies and api latencies jumped to way higher levels [15:18:17] (03PS1) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 [15:18:50] is there a way to track VM migrations in ganeti (or otherwise)? [15:19:03] the autoregister controller spiked to 10s [15:19:40] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:19:51] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1013 to wikikube-worker1154 [15:20:11] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:20:25] inflatador: https://wikitech.wikimedia.org/wiki/Ganeti#View_the_job_queue [15:20:35] first command to find the job that is the migration you care for [15:20:42] second to get an overview of how it is going [15:20:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv [15:20:45] e - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:45] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv [15:20:45] e - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:05] akosiaris thanks...getting to relive my former life as virt engineer ;) [15:21:33] :-). I assume I don't need to tell you then about the qemu monitor socket [15:21:57] but an info migrate command there should give you all the nitty gritty details [15:22:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P72383 and previous config saved to /var/cache/conftool/dbconfig/20250124-152236-marostegui.json [15:22:41] klausman: for the first page, etcd request latencies skyrocketed to 15s, see https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=k8s-mlserve&from=now-1h&to=now&viewPanel=28 [15:23:18] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye [15:23:51] akosiaris: but why would it then cause a flipflop hours later? [15:24:08] hours? I count 20 minutes [15:24:18] (03CR) 10CI reject: [V:04-1] k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [15:24:18] gah, UTC and DST and oh my [15:24:36] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1013 to wikikube-worker1154 - kamila@cumin1002" [15:24:40] lol, you can say that again. [15:24:51] So according to my IRC logs, Moritz did the move arounf 1300 my time, and it's now 1600 [15:25:01] well, 1624 [15:25:05] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1014 to wikikube-worker1155 [15:25:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1013 to wikikube-worker1154 - kamila@cumin1002" [15:25:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:10] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1154 [15:25:15] but these alerts are more recent, no? [15:25:26] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:25:28] Should I open an IC doc? [15:25:39] First one ~45minutes old [15:25:50] second is ~20 minutes old [15:25:59] that'd still be over two hours from VM move to first flop [15:26:23] the control plane VMs aren't collocated with the etcd VMs from what I remember [15:26:27] (03PS2) 10JMeybohm: k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 [15:26:35] but the etcd VMs also can't be migrated around anyway [15:26:40] akosiaris actually, I don't think I've ever used that directly. Probably a crappy xenserver command that interfaces with it instead ;) [15:26:45] they just get rebooted when needed [15:27:14] which the etcd protocol should account for as well [15:27:38] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2227,2229-2230].codfw.wmnet [15:27:45] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10492502 (10ops-monitoring-bot) depool host wikikube-worker[2227,2229-2230].codfw.wmnet by jayme@cumin1002 with... [15:27:54] !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker[2227,2229-2230].codfw.wmnet with reason: Depooled via sre.k8s.pool-depool-node [15:27:59] yeah, and there's three, once they have quorum, two can handle things [15:28:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1154 [15:29:03] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1014 to wikikube-worker1155 - kamila@cumin1002" [15:29:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1013 to wikikube-worker1154 [15:29:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1014 to wikikube-worker1155 - kamila@cumin1002" [15:29:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:32] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1155 [15:29:32] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1015 to wikikube-worker1156 [15:29:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2227,2229-2230].codfw.wmnet [15:29:52] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10492513 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 depool fo... [15:29:53] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:30:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10492525 (10cmooney) [15:30:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on parse1016:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:32:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1155 [15:32:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10492529 (10cmooney) [15:32:39] (03CR) 10CI reject: [V:04-1] k8s.pool-depool-node: Add support to downtime/remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [15:32:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1014 to wikikube-worker1155 [15:32:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10492531 (10cmooney) [15:33:07] PROBLEM - BGP status on lsw1-d3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:33] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1015 to wikikube-worker1156 - kamila@cumin1002" [15:33:37] PROBLEM - BGP status on lsw1-c6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:34:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10492540 (10cmooney) [15:35:29] kubernetes-codfw BGP errors expected, https://phabricator.wikimedia.org/T383709 [15:36:14] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1016 to wikikube-worker1157 [15:36:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1015 to wikikube-worker1156 - kamila@cumin1002" [15:36:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:15] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1156 [15:36:39] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:36:46] (03PS2) 10Cyndywikime: Add configurable MinimumTasksPerTopic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113984 (https://phabricator.wikimedia.org/T383714) [15:37:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T384592)', diff saved to https://phabricator.wikimedia.org/P72384 and previous config saved to /var/cache/conftool/dbconfig/20250124-153743-marostegui.json [15:37:48] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:37:56] (03PS1) 10Urbanecm: [testwiki] Babel: Enable CommunityConfiguration integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114002 (https://phabricator.wikimedia.org/T374348) [15:37:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1229.eqiad.wmnet with reason: Maintenance [15:38:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T384592)', diff saved to https://phabricator.wikimedia.org/P72385 and previous config saved to /var/cache/conftool/dbconfig/20250124-153805-marostegui.json [15:38:16] (03CR) 10Urbanecm: [C:04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114002 (https://phabricator.wikimedia.org/T374348) (owner: 10Urbanecm) [15:38:59] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1013.eqiad.wmnet with OS bullseye [15:39:03] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:39:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1156 [15:40:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1015 to wikikube-worker1156 [15:40:33] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1017 to wikikube-worker1158 [15:40:48] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1016 to wikikube-worker1157 - kamila@cumin1002" [15:41:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1016 to wikikube-worker1157 - kamila@cumin1002" [15:41:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:12] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1157 [15:41:13] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:42:12] !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Slavina Stefanova out of all services on: 1010 hosts [15:42:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1157 [15:42:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1016 to wikikube-worker1157 [15:43:12] !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Slavina Stefanova out of all services on: 1221 hosts [15:43:37] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:37] FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:45] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1017 to wikikube-worker1158 - kamila@cumin1002" [15:44:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1017 to wikikube-worker1158 - kamila@cumin1002" [15:44:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:50] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1158 [15:45:12] So the two flops were almost exactly 20m apart, and we've been 40m since the second one. Plus no errors in the kubelet logs of the two ctrl nodes. I am calling this tentatively fixed, but will keep an eye on things. [15:45:20] (03CR) 10Hashar: [C:04-1] "See my previous comment requesting to verify the RSA-2048 certs are no more in use and that it is indeed fine to move to ECDSA. Given that" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [15:46:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1158 [15:46:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1017 to wikikube-worker1158 [15:46:56] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1154.eqiad.wmnet wikikube-worker1155.eqiad.wmnet wikikube-worker1156.eqiad.wmnet wikikube-worker1157.eqiad.wmnet wikikube-worker1158.eqiad.wmnet on all recursors [15:46:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1154.eqiad.wmnet wikikube-worker1155.eqiad.wmnet wikikube-worker1156.eqiad.wmnet wikikube-worker1157.eqiad.wmnet wikikube-worker1158.eqiad.wmnet on all recursors [15:47:42] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:49:34] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1154.eqiad.wmnet with OS bookworm [15:49:37] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1154 [15:49:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1154 [15:49:44] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1155.eqiad.wmnet with OS bookworm [15:49:48] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1155 [15:49:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1155 [15:50:16] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1156.eqiad.wmnet with OS bookworm [15:50:19] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1156 [15:50:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1156 [15:51:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T384592)', diff saved to https://phabricator.wikimedia.org/P72386 and previous config saved to /var/cache/conftool/dbconfig/20250124-155130-marostegui.json [15:51:35] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:52:16] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1157.eqiad.wmnet with OS bookworm [15:52:19] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1157 [15:52:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1157 [15:52:35] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1158.eqiad.wmnet with OS bookworm [15:52:38] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1158 [15:52:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1158 [15:55:15] (03CR) 10Vgutierrez: "our CDN has removed RSA certificates already and gerrit ciphersuites configuration enforcing >=TLSv1.2 already enforces the usage or a rea" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [15:55:40] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:56:04] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:56:30] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:56:39] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:57:19] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:57:38] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:57:51] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:57:58] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [15:58:32] (03PS1) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [15:58:58] (03CR) 10CI reject: [V:04-1] [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe) [15:59:58] !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [16:00:20] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [16:00:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:01:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10492667 (10kamila) [16:02:54] (03PS2) 10Raymond Ndibe: [toolforge::harbor] use latest thirdparty/docker [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) [16:04:25] !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013'] [16:04:43] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013'] [16:04:57] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:50] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1155.eqiad.wmnet with reason: host reimage [16:06:23] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1156.eqiad.wmnet with reason: host reimage [16:06:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:06:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P72387 and previous config saved to /var/cache/conftool/dbconfig/20250124-160637-marostegui.json [16:07:45] (03PS2) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) [16:09:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1155.eqiad.wmnet with reason: host reimage [16:11:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1156.eqiad.wmnet with reason: host reimage [16:16:16] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2229 [16:16:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2229 [16:18:03] (03CR) 10JMeybohm: [V:03+2 C:03+2] Import upstream release 1.24.2 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1113460 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [16:19:57] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:28] 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware needs work - https://phabricator.wikimedia.org/T384722#10492747 (10Andrew) [16:20:39] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [16:20:48] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [16:21:06] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [16:21:14] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1013.eqiad.wmnet'] [16:21:40] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1158.eqiad.wmnet with reason: host reimage [16:21:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P72388 and previous config saved to /var/cache/conftool/dbconfig/20250124-162144-marostegui.json [16:21:47] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1157.eqiad.wmnet with reason: host reimage [16:21:53] 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware needs work - https://phabricator.wikimedia.org/T384722#10492754 (10Andrew) ` andrew@cumin1002:~$ sudo cookbook sre.hardware.upgrade-firmware --new --c nic 'cloudcephosd1013.eqiad.wmnet' Acquired lock for key /spicerack/locks/cookbooks/sr... [16:22:10] (03PS1) 10JMeybohm: Add bash-completion to Build-Depends [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1114008 (https://phabricator.wikimedia.org/T341984) [16:22:39] (03CR) 10JMeybohm: [V:03+2 C:03+2] Add bash-completion to Build-Depends [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1114008 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [16:23:37] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:23:56] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2230 [16:24:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2230 [16:24:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1158.eqiad.wmnet with reason: host reimage [16:26:27] !log imported istioctl 1.24.2-1 to bullseye/bookworm-wikimedia T341984 [16:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:32] T341984: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984 [16:27:43] (03PS1) 10Elukey: kserve: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114012 (https://phabricator.wikimedia.org/T369493) [16:28:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1157.eqiad.wmnet with reason: host reimage [16:28:37] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:32:11] !log pt1979@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1013'] [16:33:22] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1013'] [16:33:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1156.eqiad.wmnet with OS bookworm [16:34:08] 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware needs work - https://phabricator.wikimedia.org/T384722#10492840 (10Andrew) 05Open→03Invalid papaul just tried and it worked for him, so maybe I was doing something silly? The usage statement still needs work but I can probably fi... [16:35:10] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1013.eqiad.wmnet with OS bullseye [16:35:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2227 [16:35:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2227 [16:38:21] RECOVERY - BGP status on lsw1-d3-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:38:49] RECOVERY - BGP status on lsw1-c6-codfw.mgmt is OK: BGP OK - up: 36, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:43:04] (03PS1) 10Vgutierrez: site,swift: Use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) [16:43:16] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2227,2229-2230].codfw.wmnet [16:43:17] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10492935 (10Jhancock.wm) [16:43:19] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker[2227,2229-2230].codfw.wmnet [16:43:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker[2227,2229-2230].codfw.wmnet [16:43:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2227,2229-2230].codfw.wmnet [16:43:24] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10492939 (10ops-monitoring-bot) pool host wikikube-worker[2227,2229-2230].codfw.wmnet by jayme@cumin1002 with re... [16:43:28] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10492940 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 pool for... [16:43:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1158.eqiad.wmnet with OS bookworm [16:46:04] (03PS1) 10Elukey: knative-serving: add support for PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114016 (https://phabricator.wikimedia.org/T369493) [16:47:12] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1154.eqiad.wmnet with OS bookworm [16:47:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1157.eqiad.wmnet with OS bookworm [16:51:03] (03PS2) 10Vgutierrez: site,swift: Use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) [16:52:11] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [16:54:44] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1154.eqiad.wmnet with OS bookworm [16:54:47] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1154 [16:54:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1154 [16:57:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1155.eqiad.wmnet with OS bookworm [17:01:04] 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10493041 (10elukey) The cookbook was modified in August 2024, when we moved to Netbox 4: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1056989 And we have used it regu... [17:09:09] (03PS1) 10Scott French: mw-(api-ext|web): scale next to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114005 (https://phabricator.wikimedia.org/T383845) [17:10:03] 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10493058 (10elukey) [17:10:21] (03PS1) 10Scott French: mw-on-k8s: aggregate remaining alerts by release name [alerts] - 10https://gerrit.wikimedia.org/r/1114018 (https://phabricator.wikimedia.org/T384532) [17:10:25] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1154.eqiad.wmnet with reason: host reimage [17:12:17] 06SRE, 06Infrastructure-Foundations: sre.netbox.update-extras hits KeyError with logging - https://phabricator.wikimedia.org/T379072#10493063 (10cmooney) >>! In T379072#10493041, @elukey wrote: > And we have used it regularly: https://sal.toolforge.org/production?p=0&q=sre.netbox.update-extras&d= Yep I've use... [17:12:39] (03CR) 10Scott French: "Thanks in advance for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/1114018 (https://phabricator.wikimedia.org/T384532) (owner: 10Scott French) [17:12:48] (03CR) 10Vgutierrez: "@mvernon@wikimedia.org let me know what you think, this is preparatory work to migrate to IPIP inbound traffic in ms-fe instances" [puppet] - 10https://gerrit.wikimedia.org/r/1114015 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [17:13:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1154.eqiad.wmnet with reason: host reimage [17:32:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1154.eqiad.wmnet with OS bookworm [17:38:32] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383862#10493159 (10Jhancock.wm) 05Open→03Resolved [17:47:57] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@ebb3680]: bump up mediawiki reduced as part of temp accounts deployment [17:48:36] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@ebb3680]: bump up mediawiki reduced as part of temp accounts deployment (duration: 01m 00s) [17:56:03] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731 (10cmooney) 03NEW p:05Triage→03Low [17:57:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2136.codfw.wmnet [17:59:19] !log Removing db2136 from zarcillo T384479 [17:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:23] T384479: decommission db2136.codfw.wmnet - https://phabricator.wikimedia.org/T384479 [18:03:28] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [18:04:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:05:34] (03CR) 10Dzahn: [C:03+2] trafficserver: point spiderpig.wikimedia.org to deployment.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1113562 (https://phabricator.wikimedia.org/T383946) (owner: 10Dzahn) [18:05:38] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1154-1158].eqiad.wmnet [18:05:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1154-1158].eqiad.wmnet [18:05:44] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:05:45] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2136.codfw.wmnet [18:08:03] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2136.codfw.wmnet - https://phabricator.wikimedia.org/T384479#10493236 (10Marostegui) a:05FCeratto-WMF→03None [18:08:38] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2136.codfw.wmnet - https://phabricator.wikimedia.org/T384479#10493242 (10Marostegui) This is ready for #dc-ops [18:17:54] (03CR) 10Dzahn: [C:03+2] "Thank you, Alexandros! I have the same memory, at some point there was some technical reason for that (because I had asked before) but I c" [puppet] - 10https://gerrit.wikimedia.org/r/1113562 (https://phabricator.wikimedia.org/T383946) (owner: 10Dzahn) [18:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10493296 (10phaultfinder) [18:33:59] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10493308 (10KFrancis) Hi all, the NDA is complete. Thanks! [18:48:01] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1233.eqiad.wmnet with reason: Maintenance [18:48:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T384592)', diff saved to https://phabricator.wikimedia.org/P72390 and previous config saved to /var/cache/conftool/dbconfig/20250124-184807-marostegui.json [18:48:12] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [18:53:50] (03CR) 10Dzahn: [C:03+2] apt: update gitlab-ce to 17.7 [puppet] - 10https://gerrit.wikimedia.org/r/1113975 (https://phabricator.wikimedia.org/T379598) (owner: 10Jelto) [19:03:37] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:09] (03CR) 10Dzahn: [C:03+2] "deployed, ran puppet on apt1002 and ran the reprepro checkupdate/update commands." [puppet] - 10https://gerrit.wikimedia.org/r/1113975 (https://phabricator.wikimedia.org/T379598) (owner: 10Jelto) [19:06:26] PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:06:28] PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:08:08] PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:08:37] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:20] RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Tue 18 Feb 2025 07:56:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:09:24] RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:09:32] (03PS7) 10Dzahn: gerrit: fix todo from 2022, remove nist key setting [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) [19:10:37] (03CR) 10Dzahn: [C:03+1] "@Antoine, did you have any concerns for this one? I think we have sometimes mentally mixed this one up with the other key, the RSA key in " [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [19:12:26] PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:12:28] PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:12:49] I am going to remove that monitoring.^ [19:13:12] the service is going away soon enough to ignore that. [19:13:18] ah! [19:13:21] thanks [19:13:23] FIRING: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:13:35] (03PS1) 10Dzahn: requesttracker: remove blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1114038 [19:15:01] RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 536 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:15:16] RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:15:18] RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Tue 18 Feb 2025 07:56:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:18:23] RESOLVED: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:18:46] not sure I got the right monitoring check yet.. sending to LONG downtime [19:19:01] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 90 days, 0:00:00 on moscovium.eqiad.wmnet with reason: to be decomed [19:22:47] (03PS2) 10Dzahn: requesttracker: remove blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1114038 (https://phabricator.wikimedia.org/T384721) [19:27:28] (03CR) 10Dzahn: [C:03+2] requesttracker: remove blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1114038 (https://phabricator.wikimedia.org/T384721) (owner: 10Dzahn) [19:27:42] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 90 days, 0:00:00 on moscovium.eqiad.wmnet with reason: to be decomed [19:34:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T384592)', diff saved to https://phabricator.wikimedia.org/P72391 and previous config saved to /var/cache/conftool/dbconfig/20250124-193404-marostegui.json [19:34:09] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [19:43:37] FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:46:30] FIRING: [3x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [19:49:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P72392 and previous config saved to /var/cache/conftool/dbconfig/20250124-194911-marostegui.json [19:51:30] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [20:04:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P72393 and previous config saved to /var/cache/conftool/dbconfig/20250124-200419-marostegui.json [20:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10493696 (10phaultfinder) [20:19:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T384592)', diff saved to https://phabricator.wikimedia.org/P72394 and previous config saved to /var/cache/conftool/dbconfig/20250124-201926-marostegui.json [20:19:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1246.eqiad.wmnet with reason: Maintenance [20:19:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T384592)', diff saved to https://phabricator.wikimedia.org/P72395 and previous config saved to /var/cache/conftool/dbconfig/20250124-201947-marostegui.json [20:28:37] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:04:21] (03CR) 10Jforrester: "Yeah, the bespoke legal situation for Wikifunctions hasn't changed." [puppet] - 10https://gerrit.wikimedia.org/r/1072268 (owner: 10Jforrester) [21:05:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T384592)', diff saved to https://phabricator.wikimedia.org/P72396 and previous config saved to /var/cache/conftool/dbconfig/20250124-210515-marostegui.json [21:05:21] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [21:08:56] (03PS1) 10Zabe: Increase revision-slots cache expiry back to default for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114060 (https://phabricator.wikimedia.org/T183490) [21:12:40] !log amastilovic@deploy2002 Started deploy [airflow-dags/platform_eng@3907ed7]: (no justification provided) [21:13:14] !log amastilovic@deploy2002 Finished deploy [airflow-dags/platform_eng@3907ed7]: (no justification provided) (duration: 00m 35s) [21:15:48] !log amastilovic@deploy2002 Started deploy [airflow-dags/platform_eng@3907ed7]: (no justification provided) [21:15:57] !log amastilovic@deploy2002 Finished deploy [airflow-dags/platform_eng@3907ed7]: (no justification provided) (duration: 00m 10s) [21:20:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P72397 and previous config saved to /var/cache/conftool/dbconfig/20250124-212023-marostegui.json [21:24:13] FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:33:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [21:35:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P72398 and previous config saved to /var/cache/conftool/dbconfig/20250124-213530-marostegui.json [21:38:37] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:38:49] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [21:42:30] !log amastilovic@deploy2002 Started deploy [airflow-dags/platform_eng@ebb3680]: (no justification provided) [21:42:42] (03PS1) 10Gergő Tisza: Update CentralAuth multi-DC rules for SUL3 [puppet] - 10https://gerrit.wikimedia.org/r/1114070 (https://phabricator.wikimedia.org/T363695) [21:43:00] !log amastilovic@deploy2002 Finished deploy [airflow-dags/platform_eng@ebb3680]: (no justification provided) (duration: 00m 31s) [21:47:07] !log Testing thermal settings on cp7004 (T373993) [21:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:11] T373993: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993 [21:49:54] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7004.magru.wmnet,service=cdn [21:50:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T384592)', diff saved to https://phabricator.wikimedia.org/P72399 and previous config saved to /var/cache/conftool/dbconfig/20250124-215037-marostegui.json [21:50:42] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [21:50:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:51:30] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7004.magru.wmnet with reason: Thermal settings testing (T373993) [22:02:09] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [22:05:49] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudgw1003 - vriley@cumin1002" [22:07:15] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudgw1003 - vriley@cumin1002" [22:07:15] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:08:16] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:08:28] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudgw1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:10:24] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet,service=(cdn|ats-be) [22:10:35] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7003.magru.wmnet,service=(cdn|ats-be) [22:10:41] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7008.magru.wmnet,service=(cdn|ats-be) [22:10:55] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7006.magru.wmnet,service=(cdn|ats-be) [22:11:15] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp700[2-4].magru.wmnet,service=(cdn|ats-be) [22:11:24] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7010.magru.wmnet,service=(cdn|ats-be) [22:11:27] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7015.magru.wmnet,service=(cdn|ats-be) [22:11:43] !log pool bunch of cp7x in magru for ats-be that were depooled [22:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:09] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7004.magru.wmnet,service=cdn [22:18:32] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7004.magru.wmnet [22:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10493957 (10phaultfinder) [22:42:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2148.codfw.wmnet with reason: Maintenance [22:43:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T384592)', diff saved to https://phabricator.wikimedia.org/P72401 and previous config saved to /var/cache/conftool/dbconfig/20250124-224303-marostegui.json [22:43:08] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [22:43:25] (03CR) 10Bartosz Dziewoński: [C:04-1] "Typo – should be `wg`, not `wmg`." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113141 (https://phabricator.wikimedia.org/T378402) (owner: 10Pmiazga) [22:52:49] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudgw1003 [22:54:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudgw1003 [22:55:07] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudgw1004 [22:56:18] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudgw1004 [23:04:18] (03PS1) 10BCornwall: magru: Remove ats-be services from conftool [puppet] - 10https://gerrit.wikimedia.org/r/1114074 [23:06:00] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4861/console" [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [23:08:37] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:28] 06SRE, 10MW-on-K8s: mwgrep cannot be used from a deployment host - https://phabricator.wikimedia.org/T384764 (10Urbanecm_WMF) 03NEW [23:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10494101 (10phaultfinder) [23:26:54] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:30:29] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudgw1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:34:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T384592)', diff saved to https://phabricator.wikimedia.org/P72402 and previous config saved to /var/cache/conftool/dbconfig/20250124-233407-marostegui.json [23:34:12] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [23:36:54] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7004.magru.wmnet [23:39:16] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:39:34] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp7004.magru.wmnet [23:39:35] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7004.magru.wmnet [23:46:34] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10494148 (10BCornwall) I did some more testing: (Rounded/eyeballed averages) | Profile | Offset | Fan RPS | CPU Temp (Celsius) | Default | None | 4k | 80 | Maximum Perform... [23:49:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P72403 and previous config saved to /var/cache/conftool/dbconfig/20250124-234914-marostegui.json [23:51:30] FIRING: [5x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer