[00:17:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:20:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:41:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1227970 [00:41:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1227970 (owner: 10TrainBranchBot) [00:50:01] (03PS1) 10Zabe: Start writing to il_target_id on large s6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227972 (https://phabricator.wikimedia.org/T413526) [00:53:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1227970 (owner: 10TrainBranchBot) [01:00:46] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:11:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1227976 [01:11:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1227976 (owner: 10TrainBranchBot) [01:13:51] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 05s) [01:22:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1247.eqiad.wmnet with reason: Maintenance [01:23:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1247 (T413525)', diff saved to https://phabricator.wikimedia.org/P87660 and previous config saved to /var/cache/conftool/dbconfig/20260117-012303-marostegui.json [01:23:09] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [01:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:32:33] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1227976 (owner: 10TrainBranchBot) [03:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:18:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:20:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:06:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T413525)', diff saved to https://phabricator.wikimedia.org/P87661 and previous config saved to /var/cache/conftool/dbconfig/20260117-050635-marostegui.json [05:06:40] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [05:09:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P87662 and previous config saved to /var/cache/conftool/dbconfig/20260117-051643-marostegui.json [05:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:26:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P87663 and previous config saved to /var/cache/conftool/dbconfig/20260117-052652-marostegui.json [05:34:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:37:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T413525)', diff saved to https://phabricator.wikimedia.org/P87664 and previous config saved to /var/cache/conftool/dbconfig/20260117-053700-marostegui.json [05:37:05] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [05:37:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1248.eqiad.wmnet with reason: Maintenance [05:37:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T413525)', diff saved to https://phabricator.wikimedia.org/P87665 and previous config saved to /var/cache/conftool/dbconfig/20260117-053725-marostegui.json [06:52:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:57:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:18:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:20:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:31:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87666 and previous config saved to /var/cache/conftool/dbconfig/20260117-083103-marostegui.json [08:31:09] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:31:09] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:41:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P87667 and previous config saved to /var/cache/conftool/dbconfig/20260117-084111-marostegui.json [08:51:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P87668 and previous config saved to /var/cache/conftool/dbconfig/20260117-085119-marostegui.json [09:01:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87669 and previous config saved to /var/cache/conftool/dbconfig/20260117-090127-marostegui.json [09:01:34] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:01:34] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:01:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [09:01:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1186 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87670 and previous config saved to /var/cache/conftool/dbconfig/20260117-090151-marostegui.json [09:15:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T413525)', diff saved to https://phabricator.wikimedia.org/P87671 and previous config saved to /var/cache/conftool/dbconfig/20260117-091547-marostegui.json [09:15:52] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:25:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P87672 and previous config saved to /var/cache/conftool/dbconfig/20260117-092555-marostegui.json [09:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:36:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P87673 and previous config saved to /var/cache/conftool/dbconfig/20260117-093604-marostegui.json [09:46:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T413525)', diff saved to https://phabricator.wikimedia.org/P87674 and previous config saved to /var/cache/conftool/dbconfig/20260117-094612-marostegui.json [09:46:18] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:46:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1249.eqiad.wmnet with reason: Maintenance [09:46:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T413525)', diff saved to https://phabricator.wikimedia.org/P87675 and previous config saved to /var/cache/conftool/dbconfig/20260117-094637-marostegui.json [11:49:07] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1204 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 UGood : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:49:09] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1204 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 UGood : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T414861 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:49:19] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861 (10ops-monitoring-bot) 03NEW [12:03:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87676 and previous config saved to /var/cache/conftool/dbconfig/20260117-120324-marostegui.json [12:03:31] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [12:03:32] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [12:13:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P87677 and previous config saved to /var/cache/conftool/dbconfig/20260117-121333-marostegui.json [12:18:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:20:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:23:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P87678 and previous config saved to /var/cache/conftool/dbconfig/20260117-122342-marostegui.json [12:33:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87679 and previous config saved to /var/cache/conftool/dbconfig/20260117-123350-marostegui.json [12:33:57] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [12:33:58] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [12:34:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [12:34:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87680 and previous config saved to /var/cache/conftool/dbconfig/20260117-123414-marostegui.json [13:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:39:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T413525)', diff saved to https://phabricator.wikimedia.org/P87681 and previous config saved to /var/cache/conftool/dbconfig/20260117-133901-marostegui.json [13:39:06] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:49:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P87682 and previous config saved to /var/cache/conftool/dbconfig/20260117-134909-marostegui.json [13:59:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P87683 and previous config saved to /var/cache/conftool/dbconfig/20260117-135917-marostegui.json [14:09:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T413525)', diff saved to https://phabricator.wikimedia.org/P87684 and previous config saved to /var/cache/conftool/dbconfig/20260117-140926-marostegui.json [14:09:30] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:09:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1252.eqiad.wmnet with reason: Maintenance [14:09:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T413525)', diff saved to https://phabricator.wikimedia.org/P87685 and previous config saved to /var/cache/conftool/dbconfig/20260117-140950-marostegui.json [14:48:13] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:57:17] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:57:17] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:02:03] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:02:07] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55565 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:02:07] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:09:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:18:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:20:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:18:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:19:41] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:23:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:33:43] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:11:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T413525)', diff saved to https://phabricator.wikimedia.org/P87686 and previous config saved to /var/cache/conftool/dbconfig/20260117-181113-marostegui.json [18:11:19] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:19:41] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:21:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P87687 and previous config saved to /var/cache/conftool/dbconfig/20260117-182122-marostegui.json [18:31:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P87688 and previous config saved to /var/cache/conftool/dbconfig/20260117-183130-marostegui.json [18:41:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T413525)', diff saved to https://phabricator.wikimedia.org/P87689 and previous config saved to /var/cache/conftool/dbconfig/20260117-184138-marostegui.json [18:41:44] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:41:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1260.eqiad.wmnet with reason: Maintenance [18:42:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1260 (T413525)', diff saved to https://phabricator.wikimedia.org/P87690 and previous config saved to /var/cache/conftool/dbconfig/20260117-184203-marostegui.json [19:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:18:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:20:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:25:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:27:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:07:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:10:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:14:39] FIRING: CoreBGPDown: Core BGP session down between cr3-ulsfo and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:15:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:17:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:27:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:29:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:51:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:56:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:25:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87691 and previous config saved to /var/cache/conftool/dbconfig/20260117-222511-marostegui.json [22:25:18] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:25:18] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:35:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P87692 and previous config saved to /var/cache/conftool/dbconfig/20260117-223519-marostegui.json [22:43:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T413525)', diff saved to https://phabricator.wikimedia.org/P87693 and previous config saved to /var/cache/conftool/dbconfig/20260117-224351-marostegui.json [22:43:56] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [22:45:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P87694 and previous config saved to /var/cache/conftool/dbconfig/20260117-224528-marostegui.json [22:54:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P87695 and previous config saved to /var/cache/conftool/dbconfig/20260117-225359-marostegui.json [22:55:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87696 and previous config saved to /var/cache/conftool/dbconfig/20260117-225536-marostegui.json [22:55:43] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:55:43] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:55:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1195.eqiad.wmnet with reason: Maintenance [22:56:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1195 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87697 and previous config saved to /var/cache/conftool/dbconfig/20260117-225600-marostegui.json [23:04:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P87698 and previous config saved to /var/cache/conftool/dbconfig/20260117-230407-marostegui.json [23:07:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:14:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T413525)', diff saved to https://phabricator.wikimedia.org/P87699 and previous config saved to /var/cache/conftool/dbconfig/20260117-231416-marostegui.json [23:14:21] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [23:14:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1261.eqiad.wmnet with reason: Maintenance [23:14:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1261 (T413525)', diff saved to https://phabricator.wikimedia.org/P87700 and previous config saved to /var/cache/conftool/dbconfig/20260117-231441-marostegui.json [23:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:40:38] (03CR) 10Ori: [C:03+1] "Tested manually: https://wikitech.wikimedia.org/w/index.php?title=Release_Engineering%2FSAL&diff=2375568&oldid=2375301. LMK if you want th" [puppet] - 10https://gerrit.wikimedia.org/r/1219190 (https://phabricator.wikimedia.org/T99740) (owner: 10Krinkle)