[00:00:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P75572 and previous config saved to /var/cache/conftool/dbconfig/20250429-000049-fceratto.json [00:03:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:08:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139576 [00:09:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139576 (owner: 10TrainBranchBot) [00:12:08] (03PS1) 10Zabe: enwiki and commons: Increase revision-slots cache expiry again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139577 (https://phabricator.wikimedia.org/T183490) [00:15:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P75573 and previous config saved to /var/cache/conftool/dbconfig/20250429-001557-fceratto.json [00:18:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:31:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T392806)', diff saved to https://phabricator.wikimedia.org/P75574 and previous config saved to /var/cache/conftool/dbconfig/20250429-003104-fceratto.json [00:31:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [00:31:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T392806)', diff saved to https://phabricator.wikimedia.org/P75575 and previous config saved to /var/cache/conftool/dbconfig/20250429-003131-fceratto.json [00:38:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139576 (owner: 10TrainBranchBot) [00:39:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T392806)', diff saved to https://phabricator.wikimedia.org/P75576 and previous config saved to /var/cache/conftool/dbconfig/20250429-003948-fceratto.json [00:47:21] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:50:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:54:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P75577 and previous config saved to /var/cache/conftool/dbconfig/20250429-005455-fceratto.json [01:03:18] 06SRE, 06serviceops-radar, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774987 (10Legoktm) >>! In T392834#10773349, @elukey wrote: > ` > elukey@mwmaint1002:/home$ sudo du -hs /home/* | sort -h | tail > ... > 11G /home/legoktm > ` > > The home dirs may be... [01:09:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.27 [core] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1139578 (https://phabricator.wikimedia.org/T386222) [01:09:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.27 [core] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1139578 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [01:10:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P75578 and previous config saved to /var/cache/conftool/dbconfig/20250429-011002-fceratto.json [01:21:45] 06SRE, 06serviceops-radar, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774996 (10Dzahn) >>! In T392834#10774930, @bd808 wrote: > Dropping priority to High as it seems @Dzahn's cleanup work has taken care of the immediate problem. I'll leave it to him and... [01:22:44] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.27 [core] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1139578 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [01:25:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T392806)', diff saved to https://phabricator.wikimedia.org/P75579 and previous config saved to /var/cache/conftool/dbconfig/20250429-012509-fceratto.json [01:25:28] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1255.eqiad.wmnet with reason: Maintenance [01:25:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db[1256-1257].eqiad.wmnet with reason: Maintenance [01:35:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:50:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0200) [02:23:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:23:55] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:45] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, 07Wikimedia-production-error: Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded (via Special:UploadStash) - https://phabricator.wikimedia.org/T381109#10775079 (... [02:33:55] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:55] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:42] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10775088 (10Kirilloparma) >>! In T374230#10771849, @Silvan_WMDE wrote: > @Kirillopa... [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0300) [03:01:46] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139583 (https://phabricator.wikimedia.org/T386222) [03:01:48] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139583 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [03:02:37] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139583 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [03:03:00] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.27 refs T386222 [03:03:05] T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222 [03:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:13:42] FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:18] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10775097 (10Jakob_WMDE) >>! In T374230#10775088, @Kirilloparma wrote: > > @Silvan_WMD... [03:23:55] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:55] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:42] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0400) [04:03:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:04:24] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.44.0-wmf.27 refs T386222 (duration: 61m 23s) [04:04:28] T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222 [04:06:25] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:11:09] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:11:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:12:07] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:16:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:17:47] (03CR) 10Pppery: "Ptwikibooks isn't ready yet, it has its own separate ugly set of special cases:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe) [04:18:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:21:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:40:11] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:41:07] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:41:25] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:59:55] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:55] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:08:07] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:27] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 111, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:05] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:07] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=Transit6&var-bgp_neighbor=Lumen - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:09:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:10:27] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 111, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:11:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqsin (103.102.166.130) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:14:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:14:51] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:14:55] (03CR) 10Arnaudb: "kudos for the ascii arts 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [05:15:21] (03PS2) 10Dzahn: gerrit: replace legacy fact with modern fact [puppet] - 10https://gerrit.wikimedia.org/r/1137842 [05:15:32] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1137842 (owner: 10Dzahn) [05:20:17] (03CR) 10Arnaudb: "looks good!" [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [05:20:25] (03CR) 10Arnaudb: [C:03+1] wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [05:35:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:48:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10775197 (10Marostegui) Thank you @VRiley-WMF - I will reimage the host. [05:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:57:01] "Error: 503, Backend fetch failed at Tue, 29 Apr 2025 05:56:47 GMT" [05:57:01] :O [05:57:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0600). [06:00:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.ipmi-password-reset [06:00:25] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [06:00:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.ipmi-password-reset [06:00:59] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [06:03:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [06:04:04] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10775225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db1246.eqiad.wmnet with OS bookworm [06:04:09] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:05:09] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:20:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7002.magru.wmnet [06:21:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [06:22:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet [06:25:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [06:26:37] (03CR) 10Marostegui: [C:03+2] instance.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1139350 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [06:28:08] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:28:08] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:29:51] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:30:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet [06:30:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7002.magru.wmnet [06:31:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqsin (103.102.166.130) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:31:59] (03PS1) 10Brouberol: airflow: separate postgresql and airflow helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139657 (https://phabricator.wikimedia.org/T391348) [06:32:07] (03PS1) 10Brouberol: deployment_server: provision dedicated kubeconfigs for airflow PGs [puppet] - 10https://gerrit.wikimedia.org/r/1139659 (https://phabricator.wikimedia.org/T391348) [06:32:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1033 es2033 T391921', diff saved to https://phabricator.wikimedia.org/P75580 and previous config saved to /var/cache/conftool/dbconfig/20250429-063219-marostegui.json [06:32:24] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [06:32:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2033.codfw.wmnet,es1033.eqiad.wmnet with reason: Maintenance [06:33:00] (03PS1) 10Marostegui: es1033: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139672 (https://phabricator.wikimedia.org/T391921) [06:33:24] PROBLEM - ganeti-wconfd running on ganeti7004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [06:33:40] (03CR) 10Marostegui: [C:03+2] es1033: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139672 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:33:42] FIRING: [7x] ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:34:24] PROBLEM - ganeti-wconfd running on ganeti7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [06:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:41] (03PS1) 10Marostegui: es2033: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139703 (https://phabricator.wikimedia.org/T391921) [06:37:47] (03CR) 10Marostegui: [C:03+2] es2033: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139703 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:38:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75581 and previous config saved to /var/cache/conftool/dbconfig/20250429-063811-root.json [06:40:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75582 and previous config saved to /var/cache/conftool/dbconfig/20250429-064032-root.json [06:46:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1246.eqiad.wmnet with OS bookworm [06:46:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10775271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db1246.eqiad.wmnet with OS bookworm completed: - db1246 (**WARN**) - Removed from Puppet... [06:46:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10775272 (10Marostegui) I've reimaged the host, I had to reset the idrac password. [06:51:21] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1188.eqiad.wmnet onto db1246.eqiad.wmnet [06:51:24] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db1188 - Depool db1188.eqiad.wmnet to then clone it to db1246.eqiad.wmnet - marostegui@cumin1002 [06:51:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1188 - Depool db1188.eqiad.wmnet to then clone it to db1246.eqiad.wmnet - marostegui@cumin1002 [06:53:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75584 and previous config saved to /var/cache/conftool/dbconfig/20250429-065317-root.json [06:55:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75585 and previous config saved to /var/cache/conftool/dbconfig/20250429-065537-root.json [06:58:10] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:58:10] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:59:06] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:59:06] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:00:04] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:01:01] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra: Add instance FQDN template [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah) [07:02:00] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10775314 (10elukey) ` elukey@db1178:~$ sudo zgrep -c "SSL_read: sslv3 alert certificate unknown" /var/log/puppet.log* /var/log/puppet.log:0 /var/log/puppet.log.1:0 /var/log/puppet.log.2.gz:0 /var/log/puppet.log.3.gz:0 /... [07:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:06:38] (03PS2) 10Filippo Giunchedi: puppetdb: add tunable for maximum-pool-size [puppet] - 10https://gerrit.wikimedia.org/r/1139481 [07:07:33] (03CR) 10Filippo Giunchedi: puppetdb: add tunable for maximum-pool-size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139481 (owner: 10Filippo Giunchedi) [07:08:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75586 and previous config saved to /var/cache/conftool/dbconfig/20250429-070822-root.json [07:10:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75587 and previous config saved to /var/cache/conftool/dbconfig/20250429-071042-root.json [07:11:04] !log Reboot all codfw dbproxy2* hosts T392806 [07:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy[2005-2008].codfw.wmnet with reason: Maintenance [07:13:42] FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:27] !log imported debdeploy 0.0.99.14-1+deb13u1 to apt.wikimedia.org/main for trixie-wikimedia T391083 [07:23:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75588 and previous config saved to /var/cache/conftool/dbconfig/20250429-072328-root.json [07:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:32] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [07:25:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75589 and previous config saved to /var/cache/conftool/dbconfig/20250429-072548-root.json [07:29:21] (03CR) 10Btullis: [C:03+1] deployment_server: provision dedicated kubeconfigs for airflow PGs [puppet] - 10https://gerrit.wikimedia.org/r/1139659 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol) [07:30:13] (03CR) 10Btullis: [C:03+1] airflow: separate postgresql and airflow helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139657 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol) [07:31:49] (03PS1) 10Elukey: profile::pyrra::filesystem::slos: fix citoid's latency bucket [puppet] - 10https://gerrit.wikimedia.org/r/1139774 (https://phabricator.wikimedia.org/T391852) [07:32:16] Hi all, I'm planning to run a couple of maintenance scripts to add wikidata support for nupwiki (as per T390715). Let me know if that will disrupt anyone's deployment [07:32:16] T390715: Add Wikidata support for nupwiki - https://phabricator.wikimedia.org/T390715 [07:33:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet [07:34:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [07:37:28] (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slos: fix citoid's latency bucket [puppet] - 10https://gerrit.wikimedia.org/r/1139774 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [07:38:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75590 and previous config saved to /var/cache/conftool/dbconfig/20250429-073833-root.json [07:39:44] (03PS1) 10Slyngshede: Modern fronted [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443) [07:40:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75591 and previous config saved to /var/cache/conftool/dbconfig/20250429-074053-root.json [07:42:21] (03CR) 10Brouberol: [C:03+2] deployment_server: provision dedicated kubeconfigs for airflow PGs [puppet] - 10https://gerrit.wikimedia.org/r/1139659 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol) [07:44:06] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [07:44:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet [07:46:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:48:42] FIRING: [7x] ProbeDown: Service ganeti7001:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:49:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:50:08] !log copied wmf-certificates 1~20230906-1 from bookworm-wikimedia to trixie-wikimedia T391083 [07:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:13] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [07:50:25] (03PS1) 10Arnaudb: gerrit: drop X-Forwarded-For received from clients [puppet] - 10https://gerrit.wikimedia.org/r/1139778 (https://phabricator.wikimedia.org/T388791) [07:50:55] (03CR) 10Arnaudb: [C:03+1] gerrit: drop X-Forwarded-For received from clients [puppet] - 10https://gerrit.wikimedia.org/r/1139778 (https://phabricator.wikimedia.org/T388791) (owner: 10Arnaudb) [07:51:08] (03CR) 10Arnaudb: [C:03+2] gerrit: drop X-Forwarded-For received from clients [puppet] - 10https://gerrit.wikimedia.org/r/1139778 (https://phabricator.wikimedia.org/T388791) (owner: 10Arnaudb) [07:51:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:53:06] !log copied cadvisor 0.44.0+ds1-1~wmf1 from bookworm-wikimedia to trixie-wikimedia T391083 [07:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:23] (03CR) 10Jelto: [C:03+1] "lgtm, similar to the patch discussed in Gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1139778 (https://phabricator.wikimedia.org/T388791) (owner: 10Arnaudb) [07:53:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:53:28] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:53:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75593 and previous config saved to /var/cache/conftool/dbconfig/20250429-075339-root.json [07:53:42] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:19] (03PS1) 10Klausman: admin/data.yaml: Add dr0ptp4kt (Adam Baso) to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1139779 [07:54:50] As mentioned by @joelyrookewmde, we are about to run the maintenance script to add wikidata support for nupwiki [07:54:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:56:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75594 and previous config saved to /var/cache/conftool/dbconfig/20250429-075600-root.json [07:56:14] !log suzannewood@mwmaint1002:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https [07:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet [07:57:52] PROBLEM - Dell PowerEdge RAID Controller on db2176 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:57:53] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db2176 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T392876 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:57:58] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876 (10ops-monitoring-bot) 03NEW [07:59:16] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10775406 (10Marostegui) p:05Triage→03Medium This is a normal s1 slave - can we get a new disk for it? [08:00:04] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0800) [08:01:19] o/ [08:01:25] I am running the train for group0 [08:01:37] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139780 (https://phabricator.wikimedia.org/T386222) [08:01:38] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139780 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [08:01:49] jmm@cumin2002 drain-node (PID 2021306) is awaiting input [08:01:50] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [08:02:27] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139780 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [08:03:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7004.magru.wmnet [08:03:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:06:44] (03CR) 10Fabfur: cache,haproxy: allowed methods check and set response headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [08:08:21] (03CR) 10Brouberol: [C:03+2] airflow: separate postgresql and airflow helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139657 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol) [08:08:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75595 and previous config saved to /var/cache/conftool/dbconfig/20250429-080844-root.json [08:11:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7004.magru.wmnet [08:11:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75596 and previous config saved to /var/cache/conftool/dbconfig/20250429-081106-root.json [08:11:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet [08:12:09] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1001.eqiad.wmnet [08:13:42] FIRING: [7x] ProbeDown: Service ganeti7004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:15:45] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1139481 (owner: 10Filippo Giunchedi) [08:16:09] (03CR) 10Jelto: [C:03+1] "change looks good to me, thanks. Commit message is a bit off" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [08:17:29] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.27 refs T386222 [08:17:30] (03CR) 10Fabfur: [C:03+2] cache,haproxy: allowed methods check and set response headers [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [08:17:33] T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222 [08:18:13] (03CR) 10Fabfur: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [08:18:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:19:19] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1001.eqiad.wmnet [08:19:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [08:19:29] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet [08:19:32] (03PS1) 10Majavah: P:wmcs: toolsdb_replica_cnf: Remove HTTPS redirect [puppet] - 10https://gerrit.wikimedia.org/r/1139781 [08:20:35] (03PS1) 10Majavah: P:wmcs::proxy::static: Bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1139782 (https://phabricator.wikimedia.org/T392826) [08:21:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:22:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:22:38] marostegui@cumin1002 clone (PID 4106512) is awaiting input [08:23:28] (03PS2) 10Majavah: P:wmcs::proxy::static: Bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1139782 (https://phabricator.wikimedia.org/T392826) [08:23:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75597 and previous config saved to /var/cache/conftool/dbconfig/20250429-082349-root.json [08:24:47] (03PS1) 10Slyngshede: Upgrade to version 7.1.6 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1139783 [08:24:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Maintenance [08:25:57] !log rolling restart haproxykafka on A:cp to apply new configuration https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136679 (T382571) [08:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:01] T382571: [HAProxy migration] HAProxy and VarnishKafka should produce compatible datasets - https://phabricator.wikimedia.org/T382571 [08:26:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75598 and previous config saved to /var/cache/conftool/dbconfig/20250429-082611-root.json [08:28:01] !log installing wget security updates [08:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:59] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db1188 slowly with 10 steps - Pool db1188.eqiad.wmnet in after cloning [08:36:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet [08:36:48] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs7002.magru.wmnet} and A:liberica [08:37:09] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7002.magru.wmnet} and A:liberica [08:38:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75600 and previous config saved to /var/cache/conftool/dbconfig/20250429-083855-root.json [08:39:32] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:39:42] !log bounce prometheus-statsd-exporter on stat1011 - T389344 [08:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:46] T389344: analytics/wmde/scripts Graphite to Prometheus migration - https://phabricator.wikimedia.org/T389344 [08:40:08] PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [08:40:25] BGP alert is me [08:40:40] seaborgium.. moritzm ^^ [08:41:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75601 and previous config saved to /var/cache/conftool/dbconfig/20250429-084116-root.json [08:42:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet [08:42:08] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:42:08] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:42:08] RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.009 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [08:42:31] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10775532 (10MoritzMuehlenhoff) [08:43:06] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:43:08] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:43:20] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM. I haven't checked the syntax though." [puppet] - 10https://gerrit.wikimedia.org/r/1139782 (https://phabricator.wikimedia.org/T392826) (owner: 10Majavah) [08:43:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet [08:43:44] (03CR) 10Majavah: [C:03+2] P:wmcs::proxy::static: Bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1139782 (https://phabricator.wikimedia.org/T392826) (owner: 10Majavah) [08:45:27] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7002.magru.wmnet [08:46:38] (03CR) 10Sergio Gimeno: [C:04-1] "please review target wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [08:48:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet [08:49:32] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:51:24] (03PS1) 10Majavah: P:wmcs::proxy::static: Fix syntax for binding on both families [puppet] - 10https://gerrit.wikimedia.org/r/1139786 [08:52:05] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM. I haven't checked the syntax myself though." [puppet] - 10https://gerrit.wikimedia.org/r/1139786 (owner: 10Majavah) [08:52:19] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7002.magru.wmnet [08:54:05] (03CR) 10Majavah: [C:03+2] P:wmcs::proxy::static: Fix syntax for binding on both families [puppet] - 10https://gerrit.wikimedia.org/r/1139786 (owner: 10Majavah) [08:57:36] vgutierrez: thanks for the pointer, slapd gets restarted automatically every few weeks, this was just unfortunate timing, otherwise this doesn't trigger [08:57:42] (03CR) 10Michael Große: [C:03+1] "I think this is ok for now." [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834) (owner: 10Gergő Tisza) [08:59:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1139783 (owner: 10Slyngshede) [09:00:55] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:56] (03PS1) 10Majavah: P:wmcs::proxy::static: Fix listening on IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/1139791 [09:04:04] (03PS3) 10Zoe: Set flow boards readonly on fiwikimedia and gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) [09:04:14] (03CR) 10Majavah: [C:03+2] P:wmcs::proxy::static: Fix listening on IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/1139791 (owner: 10Majavah) [09:09:37] (03CR) 10Vgutierrez: "this needs to be in sync with the racking plan" [puppet] - 10https://gerrit.wikimedia.org/r/1139559 (https://phabricator.wikimedia.org/T392851) (owner: 10BCornwall) [09:10:41] (03CR) 10Slyngshede: [V:03+2 C:03+2] Upgrade to version 7.1.6 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1139783 (owner: 10Slyngshede) [09:10:44] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs7001.magru.wmnet} and A:liberica [09:11:06] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7001.magru.wmnet} and A:liberica [09:13:20] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:18:22] The populateSitesTable.php script we were running seems to have stopped, it succeed for tswiktionary but did not proceed from ttwiki onwards [09:18:29] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7001.magru.wmnet [09:19:01] (03PS1) 10Slyngshede: Update Debian changelog - 7.1.6 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1139794 [09:19:14] (03CR) 10Slyngshede: [V:03+2 C:03+2] Update Debian changelog - 7.1.6 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1139794 (owner: 10Slyngshede) [09:21:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7001.magru.wmnet [09:22:20] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:29] (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/1139795 (https://phabricator.wikimedia.org/T392806) [09:25:20] (03CR) 10Marostegui: "@fceratto@wikimedia.org please confirm dbproxy1023 is the active one and dbproxy1025 has the same puppet config so I can failover to dbpro" [dns] - 10https://gerrit.wikimedia.org/r/1139795 (https://phabricator.wikimedia.org/T392806) (owner: 10Marostegui) [09:25:29] 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10775662 (10elukey) All right I think both request and latency SLOs are now looking good, way better than before. After a chat with Reuven I realized that we'l... [09:29:53] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5006.eqsin.wmnet} and A:liberica [09:30:20] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5006.eqsin.wmnet} and A:liberica [09:31:13] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet [09:31:20] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:34:45] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5006.eqsin.wmnet [09:35:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:37:52] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5005.eqsin.wmnet} and A:liberica [09:39:29] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5005.eqsin.wmnet} and A:liberica [09:40:20] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:40:20] (03PS1) 10Jelto: gerrit: require user for gitiles access [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) [09:41:29] (03PS5) 10Cyndywikime: Growth-Beta: Configure higher Impact Module edit limits for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) [09:41:52] !log re-arming keyholder in acmechief and acmechief-test instances [09:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:05] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [09:44:19] !log tappof@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet [09:44:51] !log Ran fixStuckGlobalRename.php for T392873 — job (re)started OK [09:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:55] T392873: Unblock stuck global rename of Ikan - https://phabricator.wikimedia.org/T392873 [09:45:18] !log uploading haproxykafka 0.3.7 to reprepro (T387454) [09:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:22] T387454: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454 [09:45:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:46:14] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [09:46:17] !log tappof@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet [09:46:35] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [09:47:10] (03PS1) 10Federico Ceratto: sre.mysql.clone: speed up pooling in [cookbooks] - 10https://gerrit.wikimedia.org/r/1139799 (https://phabricator.wikimedia.org/T392883) [09:47:10] (03CR) 10Federico Ceratto: "Tiny change, just a speedup as discussed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1139799 (https://phabricator.wikimedia.org/T392883) (owner: 10Federico Ceratto) [09:47:13] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs5005.eqsin.wmnet [09:47:30] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10775722 (10elukey) We discussed the options on IRC, to summarize: 1) The DNS cookbook co... [09:48:20] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: fix warnings/tests [cookbooks] - 10https://gerrit.wikimedia.org/r/1137285 (owner: 10Volans) [09:48:30] (03CR) 10Marostegui: [C:03+1] sre.mysql.clone: speed up pooling in [cookbooks] - 10https://gerrit.wikimedia.org/r/1139799 (https://phabricator.wikimedia.org/T392883) (owner: 10Federico Ceratto) [09:50:23] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet [09:50:34] (03CR) 10Arnaudb: [C:03+2] gerrit: prevent crawling of some URLs [puppet] - 10https://gerrit.wikimedia.org/r/1138331 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [09:50:45] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5005.eqsin.wmnet [09:51:41] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10775738 (10elukey) From https://icinga.com/docs/icinga-2/latest/doc/24-appendix/ it seems... [09:56:27] (03CR) 10Btullis: [C:03+1] admin/data.yaml: Add dr0ptp4kt (Adam Baso) to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1139779 (owner: 10Klausman) [09:58:54] (03CR) 10Ilias Sarantopoulos: [C:03+1] admin/data.yaml: Add dr0ptp4kt (Adam Baso) to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1139779 (owner: 10Klausman) [09:58:55] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5004.eqsin.wmnet} and A:liberica [09:59:10] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5004.eqsin.wmnet} and A:liberica [09:59:22] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1000) [10:00:55] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:55] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:34] (03PS1) 10Btullis: Update clouddumps contactgroups to reflect shared ownership [puppet] - 10https://gerrit.wikimedia.org/r/1139804 [10:12:39] (03PS2) 10Btullis: Update clouddumps contactgroups to reflect shared ownership [puppet] - 10https://gerrit.wikimedia.org/r/1139804 [10:14:22] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5384/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139804 (owner: 10Btullis) [10:14:57] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs5004.eqsin.wmnet [10:18:29] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5004.eqsin.wmnet [10:22:59] (03PS1) 10Hashar: gerrit: split Gerrit and Gitiles proxy pools [puppet] - 10https://gerrit.wikimedia.org/r/1139806 (https://phabricator.wikimedia.org/T392467) [10:22:59] (03PS1) 10Hashar: gerrit: lower connections to Gitiles from 25 to 4 [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) [10:25:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [10:27:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [10:27:54] (03CR) 10Hashar: "For reference: ProxyPass doc https://httpd.apache.org/docs/2.4/mod/mod_proxy.html#proxypass" [puppet] - 10https://gerrit.wikimedia.org/r/1139806 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [10:28:04] (03CR) 10Hashar: "For reference: ProxyPass doc https://httpd.apache.org/docs/2.4/mod/mod_proxy.html#proxypass" [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [10:28:58] (03PS2) 10Gergő Tisza: mediawiki: Make refreshLinkRecommendations job less verbose [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834) [10:29:12] (03CR) 10Ladsgroup: [C:03+2] mediawiki: Make refreshLinkRecommendations job less verbose [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834) (owner: 10Gergő Tisza) [10:29:15] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Make refreshLinkRecommendations job less verbose [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834) (owner: 10Gergő Tisza) [10:30:07] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet [10:31:11] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs4010.ulsfo.wmnet} and A:liberica [10:31:34] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4010.ulsfo.wmnet} and A:liberica [10:31:58] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs4010.ulsfo.wmnet [10:32:14] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:32:18] (03CR) 10Kamila Součková: [C:03+2] GlobalBlocking: Migrate fixGlobalBlockWhitelist [puppet] - 10https://gerrit.wikimedia.org/r/1139078 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [10:32:53] (03PS1) 10Mvolz: Change citoid config for test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) [10:33:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [10:33:42] (03CR) 10CI reject: [V:04-1] Change citoid config for test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [10:33:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [10:34:02] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet [10:34:06] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:35:22] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4010.ulsfo.wmnet [10:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:08] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:36:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet [10:37:18] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:37:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [10:37:44] (03PS1) 10MVernon: Preseed: select manual setup for apus-be[1,2]004 [puppet] - 10https://gerrit.wikimedia.org/r/1139810 (https://phabricator.wikimedia.org/T392844) [10:38:05] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs4009.ulsfo.wmnet} and A:liberica [10:38:16] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4009.ulsfo.wmnet} and A:liberica [10:39:14] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:39:42] (03PS2) 10Mvolz: Change citoid config for test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) [10:40:22] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [10:40:27] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [10:40:36] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:40:46] (03CR) 10Hashar: [C:03+1] gerrit: require user for gitiles access [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [10:41:02] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:41:06] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:41:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:41:35] (03CR) 10Majavah: "I don't think 429 is a good status code for this. What about a 401 (Authentication required) or just a redirect to the login page?" [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [10:43:19] (03CR) 10Marostegui: [C:03+1] Preseed: select manual setup for apus-be[1,2]004 [puppet] - 10https://gerrit.wikimedia.org/r/1139810 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon) [10:43:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [10:44:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet [10:44:06] FIRING: [7x] ProbeDown: Service ganeti4007:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:44:53] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs4009.ulsfo.wmnet [10:46:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:46:34] (03CR) 10MVernon: [C:03+2] Preseed: select manual setup for apus-be[1,2]004 [puppet] - 10https://gerrit.wikimedia.org/r/1139810 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon) [10:47:17] (03PS1) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1139811 (https://phabricator.wikimedia.org/T385867) [10:47:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1188 slowly with 10 steps - Pool db1188.eqiad.wmnet in after cloning [10:47:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1188.eqiad.wmnet onto db1246.eqiad.wmnet [10:48:10] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4009.ulsfo.wmnet [10:48:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [10:49:06] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:49:23] (03PS1) 10Marostegui: db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1139812 (https://phabricator.wikimedia.org/T392874) [10:50:16] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139811 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [10:51:24] jmm@cumin2002 drain-node (PID 2196390) is awaiting input [10:52:09] (03CR) 10Marostegui: [C:03+2] db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1139812 (https://phabricator.wikimedia.org/T392874) (owner: 10Marostegui) [10:52:33] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10775898 (10MatthewVernon) a:05MatthewVernon→03None (done, although with manual setup as we don't know how the boss card will present the SSDs to the OS) [10:53:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10775901 (10MatthewVernon) a:05MatthewVernon→03None (done, although with manual setup as we don't know how the boss card will present the SSDs to the OS) [10:53:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75613 and previous config saved to /var/cache/conftool/dbconfig/20250429-105304-root.json [10:53:39] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs4008.ulsfo.wmnet} and A:liberica [10:54:00] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4008.ulsfo.wmnet} and A:liberica [10:54:14] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:55:15] 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10775923 (10elukey) And the issue is know: https://github.com/pyrra-dev/pyrra/issues/1465 https://github.com/pyrra-dev/pyrra/issues/1235 [10:55:56] jouncebot: now and next [10:55:56] For the next 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1000) [10:56:06] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:02:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet [11:02:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [11:03:37] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs4008.ulsfo.wmnet [11:06:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet [11:06:54] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4008.ulsfo.wmnet [11:07:06] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:08:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75614 and previous config saved to /var/cache/conftool/dbconfig/20250429-110809-root.json [11:08:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet [11:09:02] 10SRE-swift-storage, 06Commons, 10Thumbor: Error: 429, Too Many Requests - https://phabricator.wikimedia.org/T392348#10775982 (10MatthewVernon) That last URL is an `archive` URL, which I wouldn't generally expect to work (they're for deleted-by-admin objects). [11:09:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [11:09:31] (03PS2) 10Jelto: gerrit: require user for gitiles access [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) [11:09:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [11:09:53] (03CR) 10Jelto: "good point, I changed the response to 401 in patchset 2" [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [11:10:18] !log bounce prometheus-statsd-exporter on stat1011 - T389344 [11:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:23] T389344: analytics/wmde/scripts Graphite to Prometheus migration - https://phabricator.wikimedia.org/T389344 [11:11:08] (03PS1) 10Majavah: P:mariadb: packages_client: Default to 10.6 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1139817 (https://phabricator.wikimedia.org/T380073) [11:12:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet [11:13:42] FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:29] (03CR) 10Marostegui: [C:03+1] P:mariadb: packages_client: Default to 10.6 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1139817 (https://phabricator.wikimedia.org/T380073) (owner: 10Majavah) [11:14:53] (03CR) 10Majavah: [C:03+2] P:mariadb: packages_client: Default to 10.6 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1139817 (https://phabricator.wikimedia.org/T380073) (owner: 10Majavah) [11:16:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1002.eqiad.wmnet [11:16:49] (03PS3) 10AOkoth: miscweb: update os-reports image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) [11:16:56] PROBLEM - ganeti-wconfd running on ganeti4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:17:37] (03CR) 10AOkoth: "Ack. I've updated it. I started this change with a whole different idea in mind." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [11:21:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1002.eqiad.wmnet [11:21:27] (03CR) 10Arnaudb: gerrit: lower connections to Gitiles from 25 to 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [11:23:10] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:23:10] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:23:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75615 and previous config saved to /var/cache/conftool/dbconfig/20250429-112314-root.json [11:24:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org [11:26:06] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:26:08] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:28:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org [11:28:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org [11:30:13] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/1139795 (https://phabricator.wikimedia.org/T392806) (owner: 10Marostegui) [11:30:20] !log marostegui@dns1006 START - running authdns-update [11:30:38] !log Failover m2 master from dbproxy1023 to dbproxy1025 [11:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:02] (03CR) 10Arnaudb: [C:03+1] "minor question, otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1139806 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [11:32:50] !log marostegui@dns1006 END - running authdns-update [11:32:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org [11:33:16] (03PS1) 10Majavah: hieradata: Add new eqiad1 proxies [puppet] - 10https://gerrit.wikimedia.org/r/1139818 (https://phabricator.wikimedia.org/T379175) [11:34:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [11:35:55] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893 (10Madalina) 03NEW [11:36:09] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:36:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [11:37:07] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:37:55] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75616 and previous config saved to /var/cache/conftool/dbconfig/20250429-113820-root.json [11:38:55] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:27] !log installing curl security updates [11:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:39] (03CR) 10Arnaudb: [C:03+1] gerrit: require user for gitiles access [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [11:42:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [11:42:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [11:43:42] FIRING: [7x] ProbeDown: Service ganeti4005:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:44:37] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10776072 (10MoritzMuehlenhoff) [11:49:26] !log suzannewood@deploy1003:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https [11:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75617 and previous config saved to /var/cache/conftool/dbconfig/20250429-115325-root.json [11:53:30] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan2001.codfw.wmnet [11:53:42] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:20] (03PS1) 10Majavah: P:wmcs: novaproxy: Add separate keepalived_peers variable [puppet] - 10https://gerrit.wikimedia.org/r/1139821 (https://phabricator.wikimedia.org/T379175) [11:57:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5385/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139821 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [11:58:00] (03PS2) 10Majavah: P:wmcs: novaproxy: Add separate keepalived_peers variable [puppet] - 10https://gerrit.wikimedia.org/r/1139821 (https://phabricator.wikimedia.org/T379175) [11:58:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5386/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139821 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1200) [12:01:21] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2001.codfw.wmnet [12:01:44] (03PS1) 10Slyngshede: Permissions: Add comments from permission managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1139823 (https://phabricator.wikimedia.org/T392682) [12:23:26] (03PS1) 10Majavah: P:wmcs::novaproxy: Fix keepalived_peers type [puppet] - 10https://gerrit.wikimedia.org/r/1139838 [12:23:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75619 and previous config saved to /var/cache/conftool/dbconfig/20250429-122335-root.json [12:23:55] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:15] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:24:29] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:25:35] (03CR) 10Majavah: [C:03+2] P:wmcs::novaproxy: Fix keepalived_peers type [puppet] - 10https://gerrit.wikimedia.org/r/1139838 (owner: 10Majavah) [12:25:47] !log Finished populateSitesTable for nupwiki (https://phabricator.wikimedia.org/T390715) [12:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:11] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1001.eqiad.wmnet [12:28:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [12:28:42] FIRING: [10x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:29:15] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:29:55] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:30:55] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:30:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet [12:31:15] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:31:39] (03PS1) 10Btullis: Create an SSH private key in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139840 (https://phabricator.wikimedia.org/T390738) [12:31:59] PROBLEM - Host aux-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:35] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:32:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:32:54] (03PS1) 10Majavah: P:wmcs::novaproxy: Fix keepalived peer list definition [puppet] - 10https://gerrit.wikimedia.org/r/1139841 [12:33:13] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "<3" [software/bitu] - 10https://gerrit.wikimedia.org/r/1139823 (https://phabricator.wikimedia.org/T392682) (owner: 10Slyngshede) [12:34:19] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:34:39] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:34:53] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan1002.eqiad.wmnet [12:35:35] (03CR) 10Majavah: [C:03+2] P:wmcs::novaproxy: Fix keepalived peer list definition [puppet] - 10https://gerrit.wikimedia.org/r/1139841 (owner: 10Majavah) [12:35:39] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:36:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.48.95 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:36:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1139823 (https://phabricator.wikimedia.org/T392682) (owner: 10Slyngshede) [12:36:19] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:36:25] RECOVERY - Host aux-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.47 ms [12:37:05] (03PS1) 10Lucas Werkmeister (WMDE): Remove config for renaming WikibaseEntitySchema propertyType [extensions/EntitySchema] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139842 (https://phabricator.wikimedia.org/T371196) [12:37:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/EntitySchema] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139842 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [12:37:22] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10776256 (10tappof) Actually, they're defined in Puppet like this: ` # drmrs, single phase PDUs facilities::monitor_pdu_1phase... [12:37:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:37:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:37:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [12:37:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet [12:37:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [12:37:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet [12:38:09] (03PS7) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) [12:38:42] FIRING: [9x] ProbeDown: Service ganeti2023:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:38:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75620 and previous config saved to /var/cache/conftool/dbconfig/20250429-123840-root.json [12:39:39] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:40:39] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:41:10] FIRING: [8x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:42:06] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1002.eqiad.wmnet [12:43:25] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:43:42] FIRING: [9x] ProbeDown: Service ganeti2023:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:43:46] jmm@cumin2002 drain-node (PID 2314348) is awaiting input [12:45:23] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:45:23] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:45:56] jouncebot: now and next [12:45:56] For the next 0 hour(s) and 14 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1200) [12:46:09] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [12:46:10] FIRING: [10x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:46:23] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:46:23] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:46:26] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [12:46:54] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet [12:47:13] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet [12:48:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [12:48:25] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:49:10] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10776315 (10elukey) 05Open→03Resolved a:03elukey I am tentati... [12:50:23] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:50:23] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:51:10] RESOLVED: [10x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:51:45] (03PS8) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) [12:52:23] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:52:23] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:53:42] FIRING: JobUnavailable: Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:53:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75621 and previous config saved to /var/cache/conftool/dbconfig/20250429-125347-root.json [12:54:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [12:55:15] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet [12:55:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet [12:55:51] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:55:51] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:57:08] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet [12:57:51] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:57:51] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:57:53] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [12:57:54] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [12:58:24] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [12:58:42] RESOLVED: JobUnavailable: Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:58:51] FIRING: [15x] ProbeDown: Service ganeti2024:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1300). Please do the needful. [13:00:05] Daimona, zip, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:21] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs6003.drmrs.wmnet} and A:liberica [13:00:23] my patches are optional btw, if there’s no time I can do them later [13:00:29] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:00:34] o/ [13:00:43] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs6003.drmrs.wmnet} and A:liberica [13:00:43] is it not 13:00 GMT [13:01:11] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs6003.drmrs.wmnet [13:01:15] oh I see [13:01:15] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:01:24] I mis-calendared that, but yes, I'm around! [13:01:30] ok ^^ [13:01:48] I think technically it’s 13:00 GMT but Greenwich is not currently in GMT? or some nonsense like that [13:02:04] I think I had the idea that this happened at 14:00GMT [13:02:15] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:19] well, sometimes it does [13:02:24] it’s tied to the san francisco time zone [13:02:29] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:02:32] so the UTC time jumps around as the US go in and out of daylight savings time [13:02:49] * Daimona is triggered by people talking about time zones and DST [13:03:15] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:03:45] I’ll do the changes by Daimona and zip together, should be harmless [13:03:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138405 (https://phabricator.wikimedia.org/T392240) (owner: 10Daimona Eaytoy) [13:03:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe) [13:03:49] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6003.drmrs.wmnet [13:04:15] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:04:49] spiderpig go brrrrrr [13:05:13] :D [13:06:09] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:06:21] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:06:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [13:06:52] (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on 43 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138405 (https://phabricator.wikimedia.org/T392240) (owner: 10Daimona Eaytoy) [13:06:55] (03Merged) 10jenkins-bot: Set flow boards readonly on fiwikimedia and gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe) [13:07:09] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:07:21] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:07:46] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1138405|Enable the CampaignEvents extension on 43 more wikis (T392240)]], [[gerrit:1139517|Set flow boards readonly on fiwikimedia and gomwiki (T380909)]] [13:07:52] T392240: Release CampaignEvents extension to multiple ESEAP & SA wikis - https://phabricator.wikimedia.org/T392240 [13:07:52] T380909: [Config] Set Flow to read-only at all *Phase 2b* wikis - https://phabricator.wikimedia.org/T380909 [13:08:08] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs6002.drmrs.wmnet} and A:liberica [13:08:31] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs6002.drmrs.wmnet} and A:liberica [13:10:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [13:10:53] !log fab@deploy1003 Started deploy [airflow-dags/research@414def7]: (no justification provided) [13:11:09] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:11:15] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:11:31] !log fab@deploy1003 Finished deploy [airflow-dags/research@414def7]: (no justification provided) (duration: 00m 40s) [13:11:33] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:12:09] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:12:24] !log reprepro include bookworm-wikimedia dnsdist_1.8.2-1+wmf12u2_amd64.changes [13:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:33] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:04] (03CR) 10CDanis: [C:03+1] Fastnetmon: bump threshold_pps to 1.75M [puppet] - 10https://gerrit.wikimedia.org/r/1139503 (owner: 10Ayounsi) [13:14:26] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:14:28] !log lucaswerkmeister-wmde@deploy1003 daimona, zoe, lucaswerkmeister-wmde: Backport for [[gerrit:1138405|Enable the CampaignEvents extension on 43 more wikis (T392240)]], [[gerrit:1139517|Set flow boards readonly on fiwikimedia and gomwiki (T380909)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:14:31] (03PS1) 10Ssingh: Revert "wikidough: add healthcheck override for doh1001 and doh2002" [puppet] - 10https://gerrit.wikimedia.org/r/1139849 [13:14:33] T392240: Release CampaignEvents extension to multiple ESEAP & SA wikis - https://phabricator.wikimedia.org/T392240 [13:14:34] T380909: [Config] Set Flow to read-only at all *Phase 2b* wikis - https://phabricator.wikimedia.org/T380909 [13:14:43] Daimona, zip: please test on WikimediaDebug :) [13:14:53] tested [13:14:54] looking good [13:14:57] yay [13:15:02] Daimona: I expect you don’t have to test all the wikis ;) [13:15:17] (03CR) 10Ssingh: "Merging since it's a revert." [puppet] - 10https://gerrit.wikimedia.org/r/1139849 (owner: 10Ssingh) [13:15:18] (03CR) 10Ssingh: [C:03+2] Revert "wikidough: add healthcheck override for doh1001 and doh2002" [puppet] - 10https://gerrit.wikimedia.org/r/1139849 (owner: 10Ssingh) [13:15:21] (03PS1) 10Muehlenhoff: Add krb1002 to the list of KDCs presented to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1139850 (https://phabricator.wikimedia.org/T390863) [13:16:09] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:16:11] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:16:11] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:16:14] ^ expected, reboots [13:16:49] (03CR) 10Ssingh: [C:03+2] Revert "P:auth: temporarily skip returning a WARN on check_authdns_state" [puppet] - 10https://gerrit.wikimedia.org/r/1139529 (owner: 10Ssingh) [13:17:11] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:17:11] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:17:16] !log force agent run on A:dnsbox [13:17:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [13:17:17] (03PS1) 10Elukey: k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) [13:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [13:17:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet [13:18:07] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:18:10] FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:18:34] Lucas_WMDE: looks good, thanks! [13:18:42] !log lucaswerkmeister-wmde@deploy1003 daimona, zoe, lucaswerkmeister-wmde: Continuing with sync [13:18:42] FIRING: [7x] ProbeDown: Service ganeti2025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:44] great, thanks! [13:18:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139850 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [13:19:00] (03CR) 10Abijeet Patro: [C:03+1] Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [13:19:26] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:46] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs6002.drmrs.wmnet [13:20:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating - jhancock@cumin2002" [13:20:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating - jhancock@cumin2002" [13:20:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:21:09] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:21:11] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:21:11] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:21:13] RECOVERY - Wikidough DoH Check -IPv4- on doh2002 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:22:09] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:22:11] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:22:11] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:22:23] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6002.drmrs.wmnet [13:22:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:22:58] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum [13:23:09] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:23:10] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:23:15] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:23:57] !log importing haproxykafka 0.3.8 in bullseye-wikimedia (https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/83) [13:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:03] (03PS1) 10Filippo Giunchedi: thanos: enable auto memlimit [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) [13:24:06] !log disable puppet on A:durum to progressively roll out CR 1139542 [13:24:07] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:24:09] PROBLEM - Hadoop NodeManager on an-worker1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [13:25:01] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:25:26] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138405|Enable the CampaignEvents extension on 43 more wikis (T392240)]], [[gerrit:1139517|Set flow boards readonly on fiwikimedia and gomwiki (T380909)]] (duration: 17m 39s) [13:25:31] T392240: Release CampaignEvents extension to multiple ESEAP & SA wikis - https://phabricator.wikimedia.org/T392240 [13:25:31] T380909: [Config] Set Flow to read-only at all *Phase 2b* wikis - https://phabricator.wikimedia.org/T380909 [13:26:42] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [13:28:10] RESOLVED: [8x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:28:50] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10776482 (10Papaul) p:05Triage→03Medium [13:29:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:29:33] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10776486 (10Papaul) I will take a look at it when I am on site. Thank you [13:29:46] (03PS14) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [13:29:46] (03PS1) 10Majavah: keepalived: Fix IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1139857 (https://phabricator.wikimedia.org/T379175) [13:30:22] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs6001.drmrs.wmnet} and A:liberica [13:30:34] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs6001.drmrs.wmnet} and A:liberica [13:30:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/EntitySchema] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139842 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [13:31:24] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [13:31:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2048 [13:31:33] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5390/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139857 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [13:31:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [13:31:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2048 [13:31:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [13:31:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet [13:32:23] !log updated haproxykafka on cp1112 to test version 0.3.8 [13:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:43] (03Merged) 10jenkins-bot: Remove config for renaming WikibaseEntitySchema propertyType [extensions/EntitySchema] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139842 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [13:33:07] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1139842|Remove config for renaming WikibaseEntitySchema propertyType (T371196)]] [13:33:10] (03CR) 10Ssingh: [V:03+1] "Thanks for the review :)" [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:33:10] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: durum: set do_ech true for all durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:33:12] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [13:33:15] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:33:42] FIRING: [8x] ProbeDown: Service ganeti2025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:04] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1139850 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [13:36:47] !log depooling cp1112 to test new haproxykafka version behavior (T387454) [13:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:52] T387454: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454 [13:37:01] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:38:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [13:38:09] RECOVERY - Hadoop NodeManager on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:38:17] PROBLEM - Bird Internet Routing Daemon on durum2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:38:25] (03CR) 10Majavah: [V:03+1 C:03+2] keepalived: Fix IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1139857 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [13:39:17] RECOVERY - Bird Internet Routing Daemon on durum2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:39:29] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1139842|Remove config for renaming WikibaseEntitySchema propertyType (T371196)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:39:33] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [13:39:34] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [13:39:42] https://www.wikidata.org/wiki/Special:EntityData/P12861.ttl still looks good on WikimediaDebug [13:39:47] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs6001.drmrs.wmnet [13:40:11] (03CR) 10Xcollazo: [C:03+1] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [13:40:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [13:41:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet [13:42:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:42:24] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6001.drmrs.wmnet [13:42:26] PROBLEM - Host kubestagemaster2005 is DOWN: PING CRITICAL - Packet loss = 100% [13:42:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:43:15] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:43:40] ping, quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1139434 if possible? [13:43:42] FIRING: [9x] ProbeDown: Service ganeti2026:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:44:45] !log [correcting] cp1112 has NOT been depooled (T387454) [13:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:50] T387454: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454 [13:45:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet [13:45:39] RECOVERY - Host kubestagemaster2005 is UP: PING OK - Packet loss = 0%, RTA = 30.55 ms [13:46:12] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139842|Remove config for renaming WikibaseEntitySchema propertyType (T371196)]] (duration: 13m 04s) [13:46:16] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [13:46:57] FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:47:20] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3010.esams.wmnet} and A:liberica [13:47:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [13:47:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [13:47:36] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5391/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi) [13:47:41] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3010.esams.wmnet} and A:liberica [13:47:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:47:56] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3010.esams.wmnet [13:48:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:48:36] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [13:48:42] FIRING: [9x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:48:43] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [13:48:48] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2008.codfw.wmnet [13:48:53] (03CR) 10Jelto: [C:03+2] gerrit: require user for gitiles access [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [13:48:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cirrussearch2078.codfw.wmnet [13:49:07] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1008.eqiad.wmnet [13:49:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [13:49:15] jouncebot: next [13:49:15] In 1 hour(s) and 10 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1500) [13:49:20] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Unused in wmf.25 and wmf.27:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [13:49:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [13:49:38] I might slightly overrun the window depending on how long ^ takes [13:50:15] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:51:13] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3010.esams.wmnet [13:51:15] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:51:37] (03Merged) 10jenkins-bot: Remove unused EntitySchema config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [13:51:57] !log installing libcap2 security updates [13:51:57] RESOLVED: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:02] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1134693|Remove unused EntitySchema config (T371196)]] [13:52:03] pt1979@cumin2002 dhcp (PID 2388488) is awaiting input [13:52:06] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [13:53:21] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3009.esams.wmnet} and A:liberica [13:53:43] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3009.esams.wmnet} and A:liberica [13:54:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:54:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:16] jmm@cumin2002 drain-node (PID 2388691) is awaiting input [13:55:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cirrussearch2078.codfw.wmnet [13:55:50] !log upgrading haproxkafka on A:cp (T387454) [13:55:51] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [13:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:55] T387454: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454 [13:56:08] Amir1: if you have a minute, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1139434 [13:56:14] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [13:57:01] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2008.codfw.wmnet [13:57:09] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1008.eqiad.wmnet [13:58:25] (03PS1) 10Elukey: admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) [13:58:29] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1134693|Remove unused EntitySchema config (T371196)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:58:34] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [13:58:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:58:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:58:42] FIRING: [15x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:58:47] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10776579 (10Jhancock.wm) @Marostegui we caught this one right before it went out of warranty. I put in for a new drive with dell. should be here tomorrow. But I have on hands from decommed servers if yo... [13:58:53] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [13:58:55] still works [13:59:24] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [13:59:24] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10776580 (10Marostegui) It is fine to wait till tomorrow - no worries!. [13:59:29] (03CR) 10CI reject: [V:04-1] admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [14:00:05] (03PS2) 10Elukey: admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) [14:00:17] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10776582 (10Jhancock.wm) can do. Dell Service Request: 209215325 [14:00:40] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [14:01:08] (03CR) 10CI reject: [V:04-1] admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [14:01:09] !log haproxykafka upgraded and restarted on A:cp (T387454) [14:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:14] T387454: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454 [14:01:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10776590 (10MoritzMuehlenhoff) [14:02:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm [14:02:28] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3009.esams.wmnet [14:02:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10776592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm [14:02:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [14:02:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [14:02:46] OpenURI::HTTPError: 401 Unauthorized - not great from cI :( [14:03:09] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:03:09] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:03:11] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [14:03:25] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:03:42] FIRING: [16x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:03:44] (03PS3) 10Elukey: admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) [14:03:45] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [14:04:06] FIRING: [18x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:05:04] (03CR) 10CI reject: [V:04-1] admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [14:05:24] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10776602 (10bking) 05Open→03Resolved Per IRC conversation with @Papaul , he was able to get PXE booting to work w... [14:05:43] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Possible frdb2004 hardware failure. - https://phabricator.wikimedia.org/T392579#10776610 (10Jgreen) [14:05:45] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3009.esams.wmnet [14:05:45] PROBLEM - Host lvs3009 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:51] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:05:59] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134693|Remove unused EntitySchema config (T371196)]] (duration: 13m 57s) [14:06:04] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [14:06:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:06:32] !log UTC afternoon backport+config window done [14:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:37] RECOVERY - Host lvs3009 is UP: PING OK - Packet loss = 0%, RTA = 80.22 ms [14:07:09] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:07:09] RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:07:58] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [14:08:27] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:08:37] (03CR) 10Slyngshede: [C:03+2] Permissions: Add comments from permission managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1139823 (https://phabricator.wikimedia.org/T392682) (owner: 10Slyngshede) [14:08:42] FIRING: [17x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:23] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:10:23] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:11:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:11:23] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:11:23] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:11:27] (03PS9) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387866) [14:11:57] (03Merged) 10jenkins-bot: Permissions: Add comments from permission managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1139823 (https://phabricator.wikimedia.org/T392682) (owner: 10Slyngshede) [14:12:21] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3008.esams.wmnet} and A:liberica [14:12:42] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3008.esams.wmnet} and A:liberica [14:13:31] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2078.codfw.wmnet with reason: host reimage [14:13:35] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Possible frdb2004 hardware failure. - https://phabricator.wikimedia.org/T392579#10776664 (10Papaul) a:03Jhancock.wm @Jhancock.wm when you have a minutes can you please check this host. Also can you also please upgrade the CPLD. Thank you [14:14:34] (03PS2) 10AOkoth: wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) [14:15:15] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:15:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1153:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:15:47] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet [14:16:06] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585#10776676 (10ops-monitoring-bot) Host rebooted by jelto@cumin1002 with reason: eno1 has the wrong speed [14:16:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2078.codfw.wmnet with reason: host reimage [14:18:07] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5392/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi) [14:19:20] jouncebot: now and next [14:19:20] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [14:19:35] alright I'll reboot a bunch of prometheus hosts in pops [14:19:41] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3008.esams.wmnet [14:19:45] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet [14:20:00] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet [14:20:22] (03PS4) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) [14:20:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:20:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:22:18] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:22:18] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:22:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet [14:22:38] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:22:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3008.esams.wmnet [14:23:14] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:24:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:26:11] (03CR) 10Ladsgroup: [C:04-1] Catalog ContentTranslation tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [14:26:18] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:26:18] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:26:38] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:26:56] (03CR) 10Awight: [C:03+1] "ping: would be fantastic to have this reenabled now." [puppet] - 10https://gerrit.wikimedia.org/r/1139434 (owner: 10Awight) [14:27:13] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet [14:27:40] (03CR) 10Herron: [C:03+1] thanos: enable auto memlimit [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi) [14:28:42] FIRING: [8x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:03] (03CR) 10Ladsgroup: "Hi, I've got your ping multiple times now. Adding back ssh key is much less straightforward of removing it. I need to confirm the identity" [puppet] - 10https://gerrit.wikimedia.org/r/1139434 (owner: 10Awight) [14:29:06] (03CR) 10Elukey: [C:03+1] Add krb1002 to the list of KDCs presented to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1139850 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [14:29:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:30:13] (03CR) 10Elukey: "Ok to merge?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [14:30:23] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2106 to cirrussearch2106 [14:30:45] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:32:56] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet [14:32:57] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1016.eqiad.wmnet [14:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2078.codfw.wmnet with OS bullseye [14:35:57] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus4002.ulsfo.wmnet [14:36:19] bking@cumin2002 rename (PID 2431696) is awaiting input [14:36:38] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus5002.eqsin.wmnet [14:38:33] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1016.eqiad.wmnet [14:38:48] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3003.esams.wmnet [14:39:09] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus7001.magru.wmnet [14:39:11] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet [14:40:13] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908 (10RobH) 03NEW [14:40:29] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10776761 (10RobH) [14:40:32] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2106 to cirrussearch2106 - bking@cumin2002" [14:41:06] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10776762 (10RobH) a:03MatthewVernon @matthewvernon, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet u... [14:41:55] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1015.eqiad.wmnet [14:42:17] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909 (10RobH) 03NEW [14:42:21] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4002.ulsfo.wmnet [14:42:30] (03PS1) 10Ssingh: P:durum and hiera: update health check path [puppet] - 10https://gerrit.wikimedia.org/r/1139873 [14:42:35] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10776789 (10RobH) [14:42:43] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5002.eqsin.wmnet [14:42:56] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10776791 (10RobH) a:03MatthewVernon [14:43:07] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10776794 (10RobH) @MatthewVernon, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-... [14:43:34] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5393/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139873 (owner: 10Ssingh) [14:43:37] bking@cumin2002 rename (PID 2431696) is awaiting input [14:43:42] FIRING: [9x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:43:42] (03PS3) 10Ssingh: wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) [14:45:13] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus7001.magru.wmnet [14:45:14] (03CR) 10Ssingh: [V:03+1] "sukhe@durum2002:~$ /usr/lib/nagios/plugins/check_http -H yesdoh.check.wikimedia-dns.org --ssl --sni -I 185.71.138.140 -u /check -t 1 && /u" [puppet] - 10https://gerrit.wikimedia.org/r/1139873 (owner: 10Ssingh) [14:45:14] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet [14:46:14] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585#10776805 (10Jelto) 05Open→03Resolved This issue resolved after a reboot. The alert is gone. I'll resolve the task optimistically. [14:46:28] (03CR) 10Elukey: "Ack I'll try! At the moment I have trouble cherry-picking, I see some conflicts with run_ci_locally.sh :(" [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [14:47:07] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2106 to cirrussearch2106 - bking@cumin2002" [14:47:08] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:47:08] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2106 [14:47:19] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2106 [14:47:28] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658#10776816 (10MatthewVernon) This image exists in both swift clusters, dating back to 2021... [14:47:31] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1015.eqiad.wmnet [14:48:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2106 to cirrussearch2106 [14:49:15] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2106.codfw.wmnet on all recursors [14:49:18] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2106.codfw.wmnet on all recursors [14:50:12] (03PS1) 10Jelto: Revert "gerrit: require user for gitiles access" [puppet] - 10https://gerrit.wikimedia.org/r/1139874 (https://phabricator.wikimedia.org/T392467) [14:50:24] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure (sdb) on coludcephmon1004 - https://phabricator.wikimedia.org/T392458#10776834 (10Jclark-ctr) Confirmed: Service Request 209219050 was successfully submitted. [14:50:30] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1014.eqiad.wmnet [14:50:36] (03CR) 10BCornwall: [C:03+1] wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:51:28] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392428#10776838 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:52:42] (03CR) 10Jelto: [C:03+2] Revert "gerrit: require user for gitiles access" [puppet] - 10https://gerrit.wikimedia.org/r/1139874 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [14:52:58] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392427#10776845 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:53:20] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2106.codfw.wmnet with OS bullseye [14:53:20] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [14:53:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T392806)', diff saved to https://phabricator.wikimedia.org/P75622 and previous config saved to /var/cache/conftool/dbconfig/20250429-145327-fceratto.json [14:53:30] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10776852 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:53:32] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2106 [14:54:38] (03PS1) 10Majavah: keepalived: failover: Select unicast source v6 more reliably [puppet] - 10https://gerrit.wikimedia.org/r/1139877 (https://phabricator.wikimedia.org/T379175) [14:54:39] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:55:10] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10776857 (10tappof) [14:55:11] (03CR) 10Bking: [C:03+1] Create an SSH private key in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139840 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [14:55:14] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:56:17] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1014.eqiad.wmnet [14:57:10] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:57:27] (03CR) 10Bking: [C:03+1] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [14:57:46] (03PS4) 10Ssingh: wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) [14:57:52] (03CR) 10Brouberol: [C:03+1] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [14:58:15] (03CR) 10Brouberol: [C:03+1] Create an SSH private key in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139840 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [14:58:31] (03CR) 10Majavah: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5394/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139857 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [14:58:49] (03CR) 10JHathaway: [C:03+1] puppetdb: add tunable for maximum-pool-size [puppet] - 10https://gerrit.wikimedia.org/r/1139481 (owner: 10Filippo Giunchedi) [14:58:55] (03CR) 10Ssingh: "Generated with:" [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:00:04] jelto, arnoldokoth, and mutante: That opportune time for a SRE Collaboration Services office hours deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1500). [15:00:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T392806)', diff saved to https://phabricator.wikimedia.org/P75623 and previous config saved to /var/cache/conftool/dbconfig/20250429-150011-fceratto.json [15:00:12] bking@cumin2002 reimage (PID 2453344) is awaiting input [15:02:50] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [15:03:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [15:05:45] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2106 - bking@cumin2002" [15:05:49] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [15:05:50] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2106 - bking@cumin2002" [15:05:51] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:05:51] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2106.codfw.wmnet 88.48.192.10.in-addr.arpa 8.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:05:55] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2106.codfw.wmnet 88.48.192.10.in-addr.arpa 8.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:05:55] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2106 [15:06:19] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2106 [15:06:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2106 [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [15:10:44] (03PS4) 10Elukey: admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) [15:10:51] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5395/co" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi) [15:11:13] (03CR) 10Majavah: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi) [15:11:47] (03CR) 10Dreamy Jazz: Unify IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [15:13:42] FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P75624 and previous config saved to /var/cache/conftool/dbconfig/20250429-151518-fceratto.json [15:15:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [15:15:47] FIRING: [2x] ProbeDown: Service ml-serve-ctrl2001:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [15:15:58] !incidents [15:15:58] 6068 (UNACKED) [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw) [15:16:00] !ack 6068 [15:16:01] 6068 (ACKED) [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw) [15:16:22] klausman: is this you? [15:16:37] (sorry, going by SAL and a possibly related change for ml-lab1001?) [15:18:24] elukey: maybe you as well :) I am really not sure what to do here [15:18:42] FIRING: [7x] ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:07] the host seems up though [15:19:07] sukhe: o/ in theory no, I see that the kube-apiserver reloaded a while ago, I think it is due to a TLS cert reload [15:19:20] but I bumped vcores and memory to prevent this :D [15:19:30] elukey: hmm I see but a cert reload can cause a probe failure? [15:19:30] (not now, some days ago) [15:19:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet [15:20:13] weirdly, a resolve has not come in [15:20:38] sukhe: yes I know it is sad, but the kube-apiserver needs to be restarted and in the ML case it may be busy in doing multiple things while booting, not replying to health checks [15:20:47] RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl2001:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:50] ah no worries, was trying to understand it [15:20:53] ok resolve came in :) [15:20:56] thanks elukey <3 [15:21:31] np! It is weird that from https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ml-serve-ctrl2001&var-datasource=thanos&var-cluster=ml_serve I don't see the server under pressure [15:21:46] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2108 to cirrussearch2108 [15:21:57] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:22:29] yeah nothing else stands out on the server itself as well, except a smallish spike on network utilization? [15:22:33] but surely that can't be it [15:22:58] (03CR) 10Brouberol: [C:03+1] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [15:23:00] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2106.codfw.wmnet with reason: host reimage [15:23:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet [15:24:37] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10776944 (10MoritzMuehlenhoff) [15:26:13] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2108 to cirrussearch2108 - bking@cumin2002" [15:26:35] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1139877 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [15:26:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2106.codfw.wmnet with reason: host reimage [15:26:39] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2108 to cirrussearch2108 - bking@cumin2002" [15:26:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:26:40] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2108 [15:27:07] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2108 [15:27:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2108 to cirrussearch2108 [15:28:29] sukhe: I think I can confirm, kube-publish-sa-cert.service ran 15 mins ago sigh [15:28:33] timing matches perfectly [15:28:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2108.codfw.wmnet with OS bullseye [15:28:38] ok thanks elukey [15:28:46] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2108 [15:28:47] that's good to know at least, that there is a cause [15:29:24] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:30:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P75625 and previous config saved to /var/cache/conftool/dbconfig/20250429-153026-fceratto.json [15:30:30] PROBLEM - Hadoop NodeManager on an-worker1201 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:31:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:32:39] (03PS1) 10David Caro: dcaro: add yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1139887 [15:33:30] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2108 - bking@cumin2002" [15:33:36] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2108 - bking@cumin2002" [15:33:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:33:37] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2108.codfw.wmnet 90.48.192.10.in-addr.arpa 0.9.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:33:40] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2108.codfw.wmnet 90.48.192.10.in-addr.arpa 0.9.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:33:42] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2108 [15:34:07] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2108 [15:34:07] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2108 [15:35:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1153:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:36:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:37:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777021 (10tappof) [15:37:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:55] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777022 (10tappof) [15:37:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10777023 (10elukey) Today John helped me test the hot-swap behavior, and everything seems working way more nicely. 1) John swapped one... [15:38:08] (03PS1) 10Ebernhardson: Revert "Revert "Update opensearch-madvise call for version 0.2"" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) [15:38:24] (03PS2) 10Ebernhardson: Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) [15:38:47] (03CR) 10CI reject: [V:04-1] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [15:41:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138921 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [15:42:47] (03Merged) 10jenkins-bot: missing.php: Simplify code to reduce abstraction and duplication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138921 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [15:43:13] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1138921|missing.php: Simplify code to reduce abstraction and duplication (T113114)]] [15:43:18] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [15:43:47] (03PS3) 10Btullis: Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) [15:44:37] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5396/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [15:45:11] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM. Confirmed identify via videocall." [puppet] - 10https://gerrit.wikimedia.org/r/1139887 (owner: 10David Caro) [15:45:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T392806)', diff saved to https://phabricator.wikimedia.org/P75626 and previous config saved to /var/cache/conftool/dbconfig/20250429-154533-fceratto.json [15:45:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [15:46:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T392806)', diff saved to https://phabricator.wikimedia.org/P75627 and previous config saved to /var/cache/conftool/dbconfig/20250429-154559-fceratto.json [15:48:16] (03PS1) 10BCornwall: slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1139891 [15:49:55] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1138921|missing.php: Simplify code to reduce abstraction and duplication (T113114)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:49:59] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [15:50:28] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777049 (10tappof) [15:50:30] RECOVERY - Hadoop NodeManager on an-worker1201 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:50:54] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2108.codfw.wmnet with reason: host reimage [15:51:35] (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:51:42] !log sukhe@dns1004 START - running authdns-update [15:52:01] (03CR) 10Herron: [C:03+1] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1139891 (owner: 10BCornwall) [15:53:42] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:55] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777064 (10tappof) [15:53:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2108.codfw.wmnet with reason: host reimage [15:54:10] !log sukhe@dns1004 END - running authdns-update [15:54:11] (03PS4) 10Btullis: Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) [15:54:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T392806)', diff saved to https://phabricator.wikimedia.org/P75628 and previous config saved to /var/cache/conftool/dbconfig/20250429-155419-fceratto.json [15:54:32] !log krinkle@deploy1003 krinkle: Continuing with sync [15:54:47] !log ebernhardson@deploy1003 Started deploy [airflow-dags/search@5bff61a]: Update airflow-search with simplified mjolnir dag [15:54:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777075 (10tappof) [15:55:01] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5397/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [15:55:11] (03PS5) 10Btullis: Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) [15:55:12] !log ebernhardson@deploy1003 Finished deploy [airflow-dags/search@5bff61a]: Update airflow-search with simplified mjolnir dag (duration: 00m 25s) [15:55:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2106.codfw.wmnet with OS bullseye [15:55:58] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5398/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [15:57:28] (03CR) 10Ilias Sarantopoulos: [C:03+1] admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [15:58:56] (03PS6) 10Btullis: Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) [15:59:42] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5399/co" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [16:00:05] jhathaway and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:43] (03PS2) 10Krinkle: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) [16:00:50] (03CR) 10CI reject: [V:04-1] missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:01:11] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138921|missing.php: Simplify code to reduce abstraction and duplication (T113114)]] (duration: 17m 57s) [16:01:16] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [16:01:26] (03PS3) 10Krinkle: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) [16:01:48] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Possible frdb2004 hardware failure. - https://phabricator.wikimedia.org/T392579#10777103 (10Jhancock.wm) @Jgreen reseated all the connections to the backplane. server came up. I checked the firmware version of the CPLD and it is current (1.0.7). lemme... [16:04:22] (03CR) 10Btullis: [C:03+2] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [16:04:49] (03CR) 10Btullis: [C:03+2] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [16:05:00] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10777118 (10Jhancock.wm) and of course they decide to give me trouble. had to resubmit it. I'll let you know when the new drive is here and been replaced. [16:05:32] (03CR) 10David Caro: [C:03+2] dcaro: add yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1139887 (owner: 10David Caro) [16:06:06] (03PS1) 10JHathaway: ferm: ignore hidden staged files created by confd [puppet] - 10https://gerrit.wikimedia.org/r/1139893 [16:06:51] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway) [16:09:07] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10777149 (10Marostegui) Thank you! [16:09:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P75629 and previous config saved to /var/cache/conftool/dbconfig/20250429-160925-fceratto.json [16:09:33] RECOVERY - Host ms-be1060 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [16:11:07] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777168 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr @MatthewVernon I reseated the PCI RAID card and updated the BIO... [16:11:27] PROBLEM - Host ms-be1060 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:54] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777194 (10Jclark-ctr) 05Resolved→03Open [16:14:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2108.codfw.wmnet with OS bullseye [16:14:12] (03CR) 10BCornwall: [C:03+1] "The commit message isn't mentioning why you're removing the templating for durum's domain/ip addresses, so I'm a little confused about tha" [puppet] - 10https://gerrit.wikimedia.org/r/1139873 (owner: 10Ssingh) [16:14:46] (03CR) 10BCornwall: [V:03+2 C:03+2] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1139891 (owner: 10BCornwall) [16:16:29] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:16:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:16:57] (03CR) 10Btullis: [C:03+2] Create an SSH private key in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139840 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [16:17:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10777280 (10Jhancock.wm) [16:18:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:18:49] (03Merged) 10jenkins-bot: Create an SSH private key in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139840 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [16:18:52] FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:37] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 4.451 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:20:19] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:22:19] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:22:33] (03CR) 10Ssingh: [V:03+1] "Yes sorry that's on me. I will clarify it in the commit that fixes it." [puppet] - 10https://gerrit.wikimedia.org/r/1139873 (owner: 10Ssingh) [16:23:03] (03CR) 10Dzahn: gerrit: have different motd banners on active/passive servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [16:23:17] (03CR) 10Kamila Součková: [C:03+1] k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) (owner: 10Elukey) [16:23:19] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:24:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P75630 and previous config saved to /var/cache/conftool/dbconfig/20250429-162432-fceratto.json [16:24:39] (03CR) 10Dzahn: "thanks. this is to fix a warning I got on running 'puppet lint'." [puppet] - 10https://gerrit.wikimedia.org/r/1137842 (owner: 10Dzahn) [16:27:01] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777417 (10Jclark-ctr) Error came back reopened ticket [16:27:19] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:28:15] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:28:15] (03PS2) 10Kimberly Sarabia: Stream registration for article summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) [16:29:11] (03PS3) 10Jdlrobson: Stream registration for article summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [16:29:55] (03CR) 10Dzahn: "I was about to upload this and then saw it was already done. taavi, that suggestion was right." [puppet] - 10https://gerrit.wikimedia.org/r/1138995 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto) [16:31:35] (03PS2) 10Dzahn: microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) [16:31:54] (03CR) 10CI reject: [V:04-1] microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [16:32:49] (03CR) 10Majavah: [C:03+2] keepalived: failover: Select unicast source v6 more reliably [puppet] - 10https://gerrit.wikimedia.org/r/1139877 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [16:33:04] (03PS3) 10Dzahn: microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) [16:33:23] (03PS4) 10Dzahn: microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) [16:35:33] (03CR) 10CI reject: [V:04-1] microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [16:39:21] (03PS5) 10Dzahn: microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) [16:39:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T392806)', diff saved to https://phabricator.wikimedia.org/P75631 and previous config saved to /var/cache/conftool/dbconfig/20250429-163939-fceratto.json [16:39:59] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [16:40:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T392806)', diff saved to https://phabricator.wikimedia.org/P75632 and previous config saved to /var/cache/conftool/dbconfig/20250429-164005-fceratto.json [16:46:08] (03CR) 10Dzahn: [C:03+2] microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [16:47:30] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777501 (10Jclark-ctr) @wiki_willy @RobH looks like this raid card has failed Can we get a new one ordered? [16:48:46] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1139901 [16:49:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777503 (10Dzahn) out of curiosity: are we replacing this hardware anyways since it's almost 5 years old? [16:49:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T392806)', diff saved to https://phabricator.wikimedia.org/P75633 and previous config saved to /var/cache/conftool/dbconfig/20250429-164927-fceratto.json [16:49:33] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777504 (10RobH) Notes: * System warranty ended on October 27, 2023 (3 years after purchase) * 5 year life projection says this sho... [16:50:02] (03PS1) 10Majavah: keepalived: failover: Fix hiera key path [puppet] - 10https://gerrit.wikimedia.org/r/1139902 [16:50:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5400/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139902 (owner: 10Majavah) [16:51:02] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777512 (10RobH) >>! In T392796#10777500, @Jclark-ctr wrote: > @wiki_willy @RobH looks like this raid card has failed Can we get... [16:51:28] (03CR) 10Majavah: [V:03+1 C:03+2] keepalived: failover: Fix hiera key path [puppet] - 10https://gerrit.wikimedia.org/r/1139902 (owner: 10Majavah) [16:52:46] (03PS3) 10AOkoth: wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) [16:53:35] (03CR) 10Hashar: "The cow arts are not Apache2 licensed, they are licensed under `COWSAY`. The license is shipped by the Debian package `/usr/share/doc/cows" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [16:55:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777538 (10RobH) [16:56:17] (03PS6) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 [16:56:29] (03CR) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [16:56:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777542 (10wiki_willy) @Jclark-ctr - it looks like we refreshed ms-be105[1-9] towards the end of last year via T371389. Can you ch... [16:56:51] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:58:21] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777558 (10wiki_willy) Sorry, nevermind....it looks like they're HPs >>! In T392796#10777542, @wiki_willy wrote: > @Jclark-ctr - i... [16:58:47] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:59:03] (03CR) 10AOkoth: [C:03+2] wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [16:59:23] (03PS15) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [16:59:26] (03CR) 10AOkoth: [C:03+2] aphlict: ensure absent on active host [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [16:59:48] (03CR) 10Majavah: "Weirdly enough this is up next." [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1700) [17:00:09] !log aokoth@dns1004 START - running authdns-update [17:02:41] !log aokoth@dns1004 END - running authdns-update [17:03:49] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:04:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P75634 and previous config saved to /var/cache/conftool/dbconfig/20250429-170436-fceratto.json [17:04:49] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:08:26] (03PS1) 10Btullis: mediawiki-dumps-legacy: Fix helmfile secrets path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139903 (https://phabricator.wikimedia.org/T390738) [17:10:12] !log sudo cumin 'A:durum' 'disable-puppet "rolling out CR 1139873"' [17:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:46] (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum and hiera: update health check path [puppet] - 10https://gerrit.wikimedia.org/r/1139873 (owner: 10Ssingh) [17:12:18] (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Fix helmfile secrets path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139903 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [17:14:35] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Fix helmfile secrets path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139903 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [17:16:12] !log sudo cumin -b1 -s30 'A:durum and not P{durum2002*}' 'run-puppet-agent --enable "rolling out CR 1139873"' [17:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P75635 and previous config saved to /var/cache/conftool/dbconfig/20250429-171943-fceratto.json [17:22:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:24:19] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:24:51] (03CR) 10Ssingh: [C:03+1] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1139901 (owner: 10Ncmonitor) [17:25:15] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:25:21] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1139901 (owner: 10Ncmonitor) [17:27:09] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aphlict1002.eqiad.wmnet with reason: Bookworm Re-image [17:28:33] !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host aphlict1002.eqiad.wmnet with OS bookworm [17:28:42] FIRING: [9x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:55] (03PS1) 10Btullis: mediawiki-dumps-legacy: Add private values files to resources deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139906 (https://phabricator.wikimedia.org/T390738) [17:34:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T392806)', diff saved to https://phabricator.wikimedia.org/P75636 and previous config saved to /var/cache/conftool/dbconfig/20250429-173450-fceratto.json [17:35:11] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [17:35:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T392806)', diff saved to https://phabricator.wikimedia.org/P75637 and previous config saved to /var/cache/conftool/dbconfig/20250429-173517-fceratto.json [17:36:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777736 (10Ahoelzl) Approved. [17:36:35] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777738 (10Ahoelzl) [17:37:38] (03CR) 10Andrea Denisse: [C:03+1] "LGTM! I'm just curious about how was the memlimit_ratio value defined." [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi) [17:37:43] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aphlict1002.eqiad.wmnet with reason: host reimage [17:40:49] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aphlict1002.eqiad.wmnet with reason: host reimage [17:41:58] (03PS1) 10Majavah: keepalived: failover: Skip searching v6 addresses on v4-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139909 [17:43:31] (03Abandoned) 10AOkoth: miscweb: update values-os-reports env config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115944 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [17:44:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T392806)', diff saved to https://phabricator.wikimedia.org/P75638 and previous config saved to /var/cache/conftool/dbconfig/20250429-174438-fceratto.json [17:44:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5401/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139909 (owner: 10Majavah) [17:46:19] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5402/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139909 (owner: 10Majavah) [17:47:18] (03CR) 10Majavah: [V:03+1 C:03+2] keepalived: failover: Skip searching v6 addresses on v4-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139909 (owner: 10Majavah) [17:53:19] (03PS1) 10Ssingh: P:durum: hiera: log only ech_status [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) [17:53:59] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5403/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:54:59] (03PS2) 10Ssingh: P:durum: hiera: log only ech_status [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) [17:55:39] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5404/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:56:47] (03CR) 10Ssingh: [V:03+1] "@bcornwall@wikimedia.org: Hopefully this should clear up the confusion I created with the earlier commit and the intent. Let me know if yo" [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:58:42] FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P75639 and previous config saved to /var/cache/conftool/dbconfig/20250429-175946-fceratto.json [18:00:05] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1800) [18:01:42] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aphlict1002.eqiad.wmnet with OS bookworm [18:02:35] (03CR) 10Ssingh: "I am abandoning this for now. The Gitlab project is working fine so I will stick with it. For the CDN deployment, the changes should be up" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:02:41] (03Abandoned) 10Ssingh: Release 1.22.1-9+deb12u1+ech1 [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:06:30] (03PS1) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) [18:11:23] (03PS2) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) [18:14:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P75640 and previous config saved to /var/cache/conftool/dbconfig/20250429-181453-fceratto.json [18:16:42] (03CR) 10BCornwall: [C:03+1] P:durum: hiera: log only ech_status [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:17:19] (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: hiera: log only ech_status [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:17:30] (03PS3) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) [18:18:22] (03PS4) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) [18:24:40] (03PS5) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) [18:27:23] (03PS6) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) [18:30:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T392806)', diff saved to https://phabricator.wikimedia.org/P75641 and previous config saved to /var/cache/conftool/dbconfig/20250429-183000-fceratto.json [18:30:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [18:30:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [18:30:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T392806)', diff saved to https://phabricator.wikimedia.org/P75642 and previous config saved to /var/cache/conftool/dbconfig/20250429-183044-fceratto.json [18:31:22] (03PS8) 10Dzahn: gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) [18:31:42] (03CR) 10Dzahn: gerrit: have different motd banners on active/passive servers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [18:31:48] (03CR) 10CI reject: [V:04-1] Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [18:33:22] (03PS9) 10Dzahn: gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) [18:33:43] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:33:45] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:34:27] (03PS3) 10Dzahn: gerrit: replace legacy fact with modern fact [puppet] - 10https://gerrit.wikimedia.org/r/1137842 [18:35:04] (03PS1) 10Ssingh: P:durum: use /health instead of /check [puppet] - 10https://gerrit.wikimedia.org/r/1139919 [18:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:44] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5406/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139919 (owner: 10Ssingh) [18:36:26] (03CR) 10Ssingh: [V:03+1] "Updating path from I64107416fdeaffbdc00c6b5481d12494d4ccfe0d." [puppet] - 10https://gerrit.wikimedia.org/r/1139919 (owner: 10Ssingh) [18:37:23] (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: use /health instead of /check [puppet] - 10https://gerrit.wikimedia.org/r/1139919 (owner: 10Ssingh) [18:37:53] (03PS4) 10Dzahn: gerrit: replace legacy fact with modern fact [puppet] - 10https://gerrit.wikimedia.org/r/1137842 [18:37:58] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1137842/5405/" [puppet] - 10https://gerrit.wikimedia.org/r/1137842 (owner: 10Dzahn) [18:39:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T392806)', diff saved to https://phabricator.wikimedia.org/P75643 and previous config saved to /var/cache/conftool/dbconfig/20250429-183913-fceratto.json [18:47:21] (03CR) 10GergesShamon: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [18:48:43] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:48:45] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:53:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10778059 (10Jhancock.wm) @Papaul can you take a look at this one. 2047 is installed on 2048 and 2048 is installed on 2047. not sure where the swap happened. i checke... [18:54:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P75644 and previous config saved to /var/cache/conftool/dbconfig/20250429-185421-fceratto.json [18:58:07] !log fab@deploy1003 Started deploy [airflow-dags/research@414def7]: (no justification provided) [18:58:44] !log fab@deploy1003 Finished deploy [airflow-dags/research@414def7]: (no justification provided) (duration: 00m 50s) [19:01:04] (03PS7) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) [19:04:14] (03CR) 10CI reject: [V:04-1] Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [19:04:44] (03CR) 10Dzahn: [C:03+1] gerrit: add ports to hackathon nftables rule [puppet] - 10https://gerrit.wikimedia.org/r/1138995 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto) [19:06:10] (03PS8) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) [19:06:21] 06SRE, 06serviceops-radar: Cannot connect to MariaDB server from mwmaint1002 - https://phabricator.wikimedia.org/T392846#10778104 (10Dzahn) 05Open→03Resolved a:03Dzahn Well, I would say this is resolved. Just needed more disk space. And follow-ups can be done over there on the linked task. [19:09:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P75645 and previous config saved to /var/cache/conftool/dbconfig/20250429-190927-fceratto.json [19:16:19] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:17:15] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:18:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T392806)', diff saved to https://phabricator.wikimedia.org/P75646 and previous config saved to /var/cache/conftool/dbconfig/20250429-192434-fceratto.json [19:24:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [19:25:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T392806)', diff saved to https://phabricator.wikimedia.org/P75647 and previous config saved to /var/cache/conftool/dbconfig/20250429-192501-fceratto.json [19:33:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T392806)', diff saved to https://phabricator.wikimedia.org/P75648 and previous config saved to /var/cache/conftool/dbconfig/20250429-193316-fceratto.json [19:48:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P75649 and previous config saved to /var/cache/conftool/dbconfig/20250429-194824-fceratto.json [19:53:42] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:02:24] !log disabling Puppet on grafana2001 - T384841 [20:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:30] T384841: Upgrade to Grafana 11 - https://phabricator.wikimedia.org/T384841 [20:03:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P75651 and previous config saved to /var/cache/conftool/dbconfig/20250429-200331-fceratto.json [20:05:00] (03CR) 10Kimberly Sarabia: Stream registration for article summaries (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [20:09:23] (03PS3) 10Dzahn: miscweb: remove static-rt profile from legacy miscweb role [puppet] - 10https://gerrit.wikimedia.org/r/1137484 (https://phabricator.wikimedia.org/T385777) [20:14:35] (03CR) 10Dzahn: [C:03+2] miscweb: remove static-rt profile from legacy miscweb role [puppet] - 10https://gerrit.wikimedia.org/r/1137484 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [20:17:21] (03CR) 10Scott French: [C:03+1] mw:maintenance:updatequerypages: move all deadendpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [20:17:32] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [20:18:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T392806)', diff saved to https://phabricator.wikimedia.org/P75652 and previous config saved to /var/cache/conftool/dbconfig/20250429-201838-fceratto.json [20:18:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:18:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [20:19:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T392806)', diff saved to https://phabricator.wikimedia.org/P75653 and previous config saved to /var/cache/conftool/dbconfig/20250429-201905-fceratto.json [20:25:36] (03PS1) 10Dwisehaupt: monitoring: Fix check_puppetrun for failures on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1139930 (https://phabricator.wikimedia.org/T392961) [20:27:48] (03CR) 10Alexandros Kosiaris: [C:03+2] Allow releng to resume train related systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar) [20:28:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T392806)', diff saved to https://phabricator.wikimedia.org/P75654 and previous config saved to /var/cache/conftool/dbconfig/20250429-202827-fceratto.json [20:31:55] (03PS3) 10Scott French: P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139923 (https://phabricator.wikimedia.org/T388536) [20:35:58] (03PS1) 10GergesShamon: Change Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) [20:43:04] (03CR) 10CI reject: [V:04-1] Change Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [20:43:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P75655 and previous config saved to /var/cache/conftool/dbconfig/20250429-204334-fceratto.json [20:47:07] (03PS2) 10GergesShamon: Change Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) [20:55:34] (03CR) 10CI reject: [V:04-1] Change Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [20:58:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P75656 and previous config saved to /var/cache/conftool/dbconfig/20250429-205841-fceratto.json [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T2100) [21:03:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10778426 (10Jclark-ctr) @RobH this is a 740xd2 we have not had any of these decom yet [21:04:48] (03CR) 10Ryan Kemper: [C:03+2] fix inconsequential typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137356 (owner: 10Ryan Kemper) [21:05:01] (03CR) 10Ebernhardson: [C:03+1] wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [21:06:33] (03PS3) 10Ryan Kemper: wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) [21:12:46] (03PS4) 10Ryan Kemper: wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) [21:12:47] (03PS1) 10Ryan Kemper: wdqs-internal: remove lvs VIP [dns] - 10https://gerrit.wikimedia.org/r/1139936 (https://phabricator.wikimedia.org/T376151) [21:13:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T392806)', diff saved to https://phabricator.wikimedia.org/P75657 and previous config saved to /var/cache/conftool/dbconfig/20250429-211349-fceratto.json [21:14:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [21:14:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75658 and previous config saved to /var/cache/conftool/dbconfig/20250429-211415-fceratto.json [21:19:19] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:22:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75659 and previous config saved to /var/cache/conftool/dbconfig/20250429-212235-fceratto.json [21:22:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:23:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (2001:7f8:36::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:28:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:31:19] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS13030/IPv6: Idle - Init7, AS6939/IPv6: Idle - HE, AS13030/IPv4: Idle - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:31:37] (03PS2) 10Ryan Kemper: query-legacy-full: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) [21:31:48] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [21:33:15] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:34:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/0 (Peering: DE-CIX (DXDB:NAS:173434 MAC filter) {#D0067}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:35:24] (03PS3) 10Ryan Kemper: query-legacy-full: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) [21:35:32] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [21:35:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:37:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P75660 and previous config saved to /var/cache/conftool/dbconfig/20250429-213743-fceratto.json [21:38:15] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:39:26] (03CR) 10Ryan Kemper: [C:03+2] query-legacy-full: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [21:39:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/0 (Peering: DE-CIX (DXDB:NAS:173434 MAC filter) {#D0067}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:40:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:49:44] (03PS7) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) [21:52:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P75661 and previous config saved to /var/cache/conftool/dbconfig/20250429-215250-fceratto.json [21:55:37] RECOVERY - Host ms-be1060 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [21:58:42] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10778576 (10BCornwall) [21:58:52] FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:58:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10778577 (10Jclark-ctr) Removed the BBU from the RAID card. After letting the server sit for 10 minutes without the BBU, I reinstall... [21:59:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:26] !log import ncmonitor 1.3.5 to bookworm-wikimedia [22:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:42] PROBLEM - Host ms-be1060 is DOWN: PING CRITICAL - Packet loss = 100% [22:07:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75662 and previous config saved to /var/cache/conftool/dbconfig/20250429-220757-fceratto.json [22:08:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [22:08:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T392806)', diff saved to https://phabricator.wikimedia.org/P75663 and previous config saved to /var/cache/conftool/dbconfig/20250429-220823-fceratto.json [22:14:35] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:16:00] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:16:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T392806)', diff saved to https://phabricator.wikimedia.org/P75664 and previous config saved to /var/cache/conftool/dbconfig/20250429-221633-fceratto.json [22:21:22] (03CR) 10Dwisehaupt: "This code has been tested and rolled out for fr-tech. It only gets triggered if there is a puppet run failure to parse so may not have bee" [puppet] - 10https://gerrit.wikimedia.org/r/1139930 (https://phabricator.wikimedia.org/T392961) (owner: 10Dwisehaupt) [22:31:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P75665 and previous config saved to /var/cache/conftool/dbconfig/20250429-223140-fceratto.json [22:32:51] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:33:00] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:33:41] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:33:58] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:34:33] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:34:38] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:34:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10778614 (10Papaul) @Jhancock.wm you have mismatch on serial number in netbox 91 is ganeti2047 and and 90 is ganeti2048 [22:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:46:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P75666 and previous config saved to /var/cache/conftool/dbconfig/20250429-224647-fceratto.json [23:01:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T392806)', diff saved to https://phabricator.wikimedia.org/P75667 and previous config saved to /var/cache/conftool/dbconfig/20250429-230155-fceratto.json [23:02:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [23:02:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T392806)', diff saved to https://phabricator.wikimedia.org/P75668 and previous config saved to /var/cache/conftool/dbconfig/20250429-230222-fceratto.json [23:10:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T392806)', diff saved to https://phabricator.wikimedia.org/P75669 and previous config saved to /var/cache/conftool/dbconfig/20250429-231031-fceratto.json [23:18:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:25:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P75670 and previous config saved to /var/cache/conftool/dbconfig/20250429-232538-fceratto.json [23:29:06] (03CR) 10Ssingh: varnish: Replace X-IS-ALT-DOMAIN with variable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [23:31:36] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:32:32] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:38:52] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:39:48] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:39:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139952 [23:39:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139952 (owner: 10TrainBranchBot) [23:40:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P75671 and previous config saved to /var/cache/conftool/dbconfig/20250429-234045-fceratto.json [23:41:32] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:42:52] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:44:48] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:48:54] jouncebot: nowandnext [23:48:54] No deployments scheduled for the next 6 hour(s) and 11 minute(s) [23:48:54] In 6 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0600) [23:50:24] (03CR) 10Zabe: [C:03+2] enwiki and commons: Increase revision-slots cache expiry again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139577 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [23:51:16] (03Merged) 10jenkins-bot: enwiki and commons: Increase revision-slots cache expiry again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139577 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [23:51:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139952 (owner: 10TrainBranchBot) [23:51:53] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1139577|enwiki and commons: Increase revision-slots cache expiry again (T183490)]] [23:51:58] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [23:53:42] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:55:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T392806)', diff saved to https://phabricator.wikimedia.org/P75672 and previous config saved to /var/cache/conftool/dbconfig/20250429-235552-fceratto.json [23:56:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [23:58:43] !log zabe@deploy1003 zabe: Backport for [[gerrit:1139577|enwiki and commons: Increase revision-slots cache expiry again (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:58:48] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [23:58:53] !log zabe@deploy1003 zabe: Continuing with sync [23:59:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed